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Preface 


The data is changing the way society and technology evolves, with the advent of IoT, 
Big Data, ML and AI, a rapid development in technology towards more human- 
centric applications has been envisaged. The finance and insurance sectors are not 
an exception and developments in Fin Tech and insurance-tech are in a phase of 
developing unique offerings. 

It is very important to have a common understanding of the actual conditions 
in the financial and insurance sectors and how the technology can help to advance 
and evolve those conditions in a positive manner. By discussing the principles of 
the modern economy that make the modern financial sector and FinTech the most 
disruptive areas in today's global economy, a better understanding and knowledge 
will be acquired. 

The use of data-driven approaches envisions many opportunities emerging for 
activating new channels of innovation on the local and global scale while at the 
same time catapulting opportunities for more disruptive human-centric services. 
Data-driven human-centric applications are at the same time the result of a shared 
vision from a natural evolution of technology and society. Experts in the financial 
and insurance sectors are looking at a dramatic change in how people think about 
global economy and at the same time the technology is facilitating the instruments 
for new ways of understanding, providing a common vision and identifying impacts 
in finance and insurance. 

The INFINITECH book series is focused on addressing the need for clear infor- 
mation for better understanding of the foundations, principles and technologies for 
experts and non-technical experts that participate in the financial and insurance 
process and the constant need for innovation and new services across banks and 
insurance organizations. 
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Who Should Read This Book? 


Financial & Insurance Regulators 


The unique offering for non-technical experts but that participate in the financial 
regulatory process and of the core service to enable the sharing of innovation and 
new services across banks and insurance without exchanging any customer data. 


General Public & Students 


The power of understanding the future of Fin Techs, their services and their ability 
to identify different methodologies indicators from a human perspective. 


Entrepreneurs and SMEs 


The most powerful tools to innovate, increase opportunities and increase the power 
of innovation into small and entrepreneurs to meet its full potential if there is good 
participation across the banking and insurance sector. 


Technical Experts & Software Developers 


The guide for technologies and legacy open and non-open sources as a guidebook 
for including the most recent experiences in Europe towards innovating technology 
for the financial and banking sectors. 
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What is Addressed in the Book Series? 


*Concepts and Design Thinking Innovation addressing the Global 
Financial Needs" 


In the first part of the INFINITECH book series we begin by discussing the prin- 
ciples of the modern economy that make the modern financial sector and Fin Tech 
the most disruptive areas in today's global economy. INFINITECH envision many 
opportunities emerging for activating new channels of innovation on the local and 
global scale while at the same time catapulting opportunities for more disruptive 
user- centric services. INFINITECH is at the same time the result ofa shared vision 
from a representative global group of experts, providing a common vision and iden- 
tifying impacts in the financial and insurance sectors. 


*Methods and Design Principles for Financial Innovation, Explaining the 
Supply Side for Interoperability in Finance- and Insurance- Tech" 


In the second part of the series we review the basic concepts for Fintech referring to 
the diversity in the use of technology to underpin the delivery of financial services. 
The demand and the supply side in the financial sector are demonstrated, and fur- 
ther discussed is why FinTech is the focus of industry nowadays and the meaning 
for waves of digitization. Financial technology (Fin Tech) and insurance technology 
(InsuranceTech) are rapidly transforming the financial and insurance services indus- 
try. We provide an overview of Reference Architecture (RA) for BigData, IoT and AI 
applications in the financial and insurance sectors (INFINITECH-RA). Moreover, 
this book reviews the concept of innovation and its application in INFINITECH, 
and innovative technologies provided by the project for financial sector practical 
examples. 
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What is Addressed in the Book Series? ix 


“Technical Financial Innovation, Solving the Interoperability 
Problems of Europe" 


The third book begins by providing a definition for FinTech as: The use of tech- 
nology to underpin the delivery of financial services. This book further discusses 
why FinTech is the focus of industry nowadays as the waves of digitization and 
the way financial technology (Fin Tech) and insurance technology (InsuranceTech) 
are rapidly transforming the financial and insurance services industry. In this 
book technology assets that followed the Reference Architecture (RA) for BigData, 
IoT and AI applications are introduced. Moreover, the series of assets includes 
the domain area where applications from the INFINITECH innovation project 
and the concept of innovation for the financial sector are described. Further, we 
describe INFINITECH Marketplace and its components including details of avail- 
able assets. Next, we provide descriptions of solutions developed in INFINITECH. 


What is Covered in this 
INFINITECH Part Il Book? 


*Methods and Design Principles for Financial Innovation, Explaining the 
Supply Side for Interoperability in Finance- and Insurance- Tech" 


In this second part of the series we review the basic concepts for FinTech referring 
to the diversity in The use of technology to underpin the delivery of financial services. 

The demand and the supply side in the financial sector are demonstrated, and 
further discussed is why FinTech is the focus of industry nowadays and the meaning 
for waves of digitization. 

Financial technology (FinTech) and insurance technology (InsuranceTech) are 
rapidly transforming the financial and insurance services industry. 

We provide an overview of Reference Architecture (RA) for BigData, IoT and AI 
applications in the financial and insurance sectors (INFINITECH-RA). Moreover, 
this book reviews the concept of innovation and its application in INFINITECH, 
and innovative technologies provided by the project for financial sector practical 
examples. 


Acknowledgements 


To our families for their incomparable affection, jollity and always understanding 
that scientific career is not a work but a lifestyle and encouraging us to be creative, 
and for their enormous patience during the time away from them, invested in our 
scientific endeavours and responsibilities and for their understanding about loving 
too much our professional life and its consequences, we love you!. 

To all our friends and relatives for their comprehension when we have no time to 
spend with them and we are not able to join in time because we are in a conference 
or attending yet another meeting, for their attention and the interest they have 
been shown all this time to keep alive our friendship; be sure our sacrifices are well 
rewarded. 

To all our colleagues, staff members and students at our respectively institu- 
tions, organisations and companies for patiently listening with apparent attention 
the descriptions and progress of our work and for the great experiences and the 
great time spent while working together with us and the contributions provided to 
culminate this book series project. In particular thanks by the support and confi- 
dence from all people that believed this series of books would be finished in time 
and also to those that didn’t trusted on it, because thanks to them we were more 
motivated to culminate the project. 

To the scientific community, this is our family when we are away and working far 
from our loved ones, for their incomparable affection, loyalty and always encour- 
aging to be creative, and for their enormous patience during the time invested 
in understanding, presenting and providing feedback to new concepts and ideas, 
sincerely to you all, thanks a million! 


Martín Serrano on Behalf of All Authors 


xi 


Contributing Authors 


Achille Zappa 
NUIG-Insight, Ireland 


Adrien Besse 
FTS, France 


Aikaterini Karamargiou 
NBG, Greece 


Akshay Shetty 
PRIVE, Germany 


Alain Vailati 
HPE, Italy 


Alberto Crespo 
ATOS, Spain 


Alberto Danese 
NEXI, Italy 


Aleksandra Cargo 
BOS, Slovenia 


Alessandra Forlano 
ENG, Italy 


Alessandro Amicone 


GFT, Italy 


Alessandro Mamelli 
HPE, Italy 


Alessio Del Soldato 
NEXI, Italy 


Alex Acquier 
NUIG-Insight, Ireland 


Alexander Kostopoulos 
RB, UK 


Alper Sen 
BOUN, Turkey 


Andrea Becerra 
CTAG, Spain 


Andrea Grillo 
PI, Greece 


Andrea Toro 
HPE, Italy 


Andreas Politis 
DYN, Greece 
Angeliki Kitsiou 
CP, Greece 

Anja Rijavec Ursej 
BOS, Slovenia 


Ann Smith 
BPFI, Ireland 


Contributing Authors 


Anna Semeniuk 
PRIVE, Austria 


Annalisa Ceccarelli 
PI, Greece 


Anne Elisabeth Lenel 
ORT, France 


Antonis Litke 
INNOV, Cyprus 


Antonis Skarpelis 
CP, Greece 


Ariana Polyviou 
INNOV, Cyprus 


Ariana Polyviou 
INNOV, Cyprus 


Aristodemos Pnevmatikakis 
ISPRINT, Belgium 

Baran Kilic 

BOUN, Turkey 


Barbara Cacciamani 
ABILAB, Italy 


Bardia Khorsand 
NUIG-Insight, Ireland 


Beatrice Paolone 
PI, Greece 


Bjoern Torkar 
PRIVE, Germany 


Borja Pintos Castro 
GRAD, Spain 


Brigitte Benerink 
RRD, Netherland 


Bruno Almeida 
UND Portugal 


xiii 


Bruno Lepri 
FBK, Italy 


Can Ozturan 
BOUN, Turkey 


Candeago Candeago 
FBK, Italy 


Carlos Albo 
WEA, Spain 


Carmen Furquet 
INSO, Spain 


Carmen Perea 


ATOS, Spain 


Chi Hung Le 
NUIG-Insight, Ireland 


Christian Hanley 
PRIVE, Austria 


Christiane Grunloh 
RRD, Netherland 


Christina Katsikari 
RB, UK 


Christoforos Symvoulidis 


SILO, Greece 


Christopher Genillard 
GEN, Germany 


Chrysostomos Symvoulidis 
UPRC, Greece 


Claudia Amador 
UND Portugal 


Claudia Mertinger 
FTSG, Germany 


Craig Macdonald 
GLA, UK 


xiv 


Cyril Armange 
FI, France 


Danae Lekka 
ISPRINT, Belgium 


Dario Francés 

WEA, Spain 

David Delgado 

WEA, Spain 

Davide Dalle Carbonare 
ENG, Italy 


Davide Profeta 
ENG, Italy 


Dejan Adamic 
BOS, Slovenia 
Diego Burgos 
LXS, Spain 


Dimitrios Kotios 
UPRC, Greece 


Dimitrios Miltiadou 
UBI, Greece 


Dimitris Drakoulis 
INNOV, Cyprus 


Dimitris Dres 
INNOV, Cyprus 


Dimosthenis Kyriazis 
UPRC, Greece 


Diogo Inácio 
UNP, Portugal 


Domenico Costantino 
HPE, Italy 


Domenico Messina 
ENG, Italy 


Contributing Authors 


Dominik Hedderich 
GEN, Germany 


Dominique Faessel 
ORT, France 


Dustin Ciccardini 
JRC, Germany 


Ehsan Arefifar 
CTAG, Spain 


Elena Battistini 
GFT, Italy 


Elena Femenia 
INSO, Spain 


Eleni Mavrogalou 
CP, Greece 


Eleonora Ascolani 
PI, Greece 


Eoin Jordan 
NUIG-Insight, Ireland 


Erdem Oguz 
AKTIE Turkey 


Ernesto Troiano 


GFT, Italy 


Eva Sotos Martinez 


GRAD, Spain 


Evelina Peristeri 


SILO, Greece 


Eymard Hooper 
BOI, Ireland 


Fabiana Fournier 
IBM, Israel 


Fabio Dezi 
NEXI, Italy 


Contributing Authors 


Fabio Magrassi 
GFT, Italy 


Farid Meinkohn 
ORT, France 


Fethi Ata 
AKTIE, Turkey 
Filip Koprivec 
JSI, Slovenia 


Filipa Sousa 
UND Portugal 


Gabriele Gamberi 
ABILAB, Italy 


Gabriele Santin 
EBK, Italy 


Gary Thompson 
BOI, Ireland 


Gavin Purtill 
BPFI, Ireland 


George Fatouros 
INNOV, Greece 
George Giaglis 
UNIC, Cyprus 
George Karamanolis 
CP, Greece 

Georgios Makridis 
UPRC, Greece 


German Herrero 


ATOS, Spain 


Giacomo Toselli 
SIA, Italy 


Giancarlo Sfolcini 
SIA, Italy 


XV 


Giorgia Gazzarata 
GFT, Italy 


Giorgio Dabormida 
GFT, Italy 


Giorgio Roffo 
GLA, UK 


Giovanni Di Orio 
NOVA, Portugal 


Gisela Sanchez 
FI, France 


Giuseppe Avigliano 
PI, Greece 


Gokcehan Kara 
BOUN, Turkey 


Gregor Krzmanc 
JSI, Slovenia 


Gregor Zunic 
JSI, Slovenia 


Grigoris Mygdakos 
AGRO, Greece 


Guilherme De Brito 
NOVA, Portugal 


Harm op den Akker 
ISPRINT, Belgium 


Harm Opdenakker 
RRD, Netherland 


Hermie Hermens 


RRD, Netherland 


Hoan Nguyen 
NUIG-Insight, Ireland 


Iacopo De-angelis 
PI, Greece 


xvi 


Iadh Ounis 
GLA, UK 


Iago Abad Fernandez 
GRAD, Spain 

Ian Godfrey 

FTS, France 


Ian Shiundu 
RB, UK 

Ignacio Elicegui 
ATOS, Spain 


Ilesh Dattani 
ASSEN, Ireland 


Ines Ortega-Fernandez 
GRAD, Spain 

Inna Skarbovsky 

IBM, Israel 


Irene Zattarin 


GFT, Italy 


Javier Rodriguez Viñas 


GRAD, Spain 


Javier Sanz-Cruzado Puig 
GLA, UK 


Javier Yepez Martínez 


GRAD, Spain 


Jelena Milosevic 
BOS, Slovenia 


John Soldatos 
INNOV, Cyprus 


Jonathan Gay 
ASSEN, Ireland 


Juan Mahilo 
LXS, Spain 


Contributing Authors 


Juergen Neises 


FTSG, Germany 


Julian Schillinger 
PRIVE, Germany 


Julien Mousset 
PRIVE, Germany 


Klaudija Jurkosek-Seitl 
BOS, Slovenia 


Klaus Brisch 
DWE, Germany 


Klemen Kenda 
JSI, Slovenia 


Konstantina Kostopoulou 
ISPRINT, Belgium 


Konstantina Tripodi 
JRC, Germany 


Konstantina Zafeiri 
NBG, Greece 


Kostas Perakis 
UBI, Greece 


Lambis Dionysopoulos 
UNIC, Cyprus 


Lena Neidhardt 
GEN, Germany 


Lex Vanvelsen 
RRD, Netherland 


Lilian Adkinson Orellana 
GRAD, Spain 


Luca Latella 
NEXI, Italy 


Lucile Aniksztejn 
FI, France 


Contributing Authors 


Lukas Linden 
GEN, Germany 


Maanasa Srikrishna 
GLA, UK 


Machi Simeonidou 


AGRO, Greece 


Mads Tingsgard 
CPH, Denmark 


Magdalena Schmid 
GEN, Germany 
Maja Skrjanc 

JSI, Slovenia 


Manolis Syllignakis 
NBG, Greece 


Manuela Masci 
PI, Greece 


Marc Meerkamp 
DWE, Germany 


Marcio Mateus 
UND Portugal 


Marco Avallone 
PI, Greece 


Marco Crabu 
ABILAB, Italy 


Marco Muller Terjung 
DWE, Germany 


Marco Pistore 
FBK, Italy 


Marco Rotoloni 
ABILAB, Italy 


Marco Spallaccini 
PI, Greece 


Marcos Alvarez Diaz 


GRAD, Spain 


Marcos Cabeza 


CTAG, Spain 


Margarita Khokhlova 
FTS, France 


Maria José Poveda 
WEA, Spain 
Maria de Vries 
GEN, Germany 


Maria Smyth 
NUIG-Insight, Ireland 


Marian Hurmuz 
RRD, Netherland 


Marianna Charalambous 
UNIC, Cyprus 


Mariarosaria Russo 
ENG, Italy 


Marina Cugurra 
GFT, Italy 


Marina Rodriguez Hidalgo 
LIB, Spain 


Marinos Xynarianos 
CP, Greece 


Mario Maawad Marcos 
CXB, Spain 


Mario Trinchera 
ABILAB, Italy 


Marko Grobelnik 
JSI, Slovenia 


Marta Sestelo 
GRAD, Spain 


xvii 


xviii 


Martin J. Serrano Orozco 
NUIG-Insight, Ireland 


Massimiliano Aschi 
PI, Greece 


Massimiliano Aschi 
PI, Greece 


Massimiliano Hocevar 
PI, Greece 


Matej Koletnik 
BOS, Slovenia 


Matteo Falsetta 
GFT, Italy 


Matteo Gerosa 
FBK, Italy 


Maurizio Ferraris 


GFT, Italy 


Maurizio Megliola 
GFT, Italy 


Maximilien Nayaradou 
FI, France 


Michael Concannon 
BPFI, Ireland 


Michael Michalakoukos 
DYN, Greece 


Michael Psalidas 
CP, Greece 


Misu Helal Ali 
FTS, France 
Mitja Jermol 
JSI, Slovenia 


Mojca Trstenjak 
BOS, Slovenia 


Contributing Authors 


Nadia Roberti 
PI, Greece 


Napoleon Liontos 
CP, Greece 


Neil Giles 
TAH, UK 


Nial O’Brolchain 
NUIG-Insight, Ireland 


Niarchos Vasilios 
NBG, Greece 


Nicola Masi 
ENG, Italy 


Nikolaos Kapsoulis 
INNOV, Cyprus 


Nikos Drosos 
SILO, Greece 


Nikos Droukas 
NBG, Greece 


Nuria Ituarte Aranda 
ATOS, Spain 


Oliver Sjastedt 
CPH, Denmark 


Omerbora Zeybek 
AKTIF Turkey 


Orkan Metin 
AKTIF Turkey 


Pablo Carballo 
PRIVE, Germany 


Padraig Flannery 
BOI, Ireland 


Palmira Aldeguar 
LIB, Spain 


Contributing Authors 


Paolo Testa 
NEXI, Italy 


Patrick Karlsson 
RB, UK 


Patrizio Sangermano 
PI, Greece 


Paul Lefrere 
CCA, France 


Pavlos Kranas 
LXS, Spain 
Pedro Malo 
NOVA, Portugal 


Perdikouri Eleni 
NBG, Greece 


Petra Ristau 
JRC, Germany 


Phil Atherton 
TAH, UK 
Prokopaki Georgia 
NBG, Greece 


Qaiser Mehmood 
NUIG-Insight, Ireland 


Raman Kazhamiakin 
FBK, Italy 


Ramon Martin de Pozuelo 
CXB, Spain 


Rebeca Jiménez 
WEA, Spain 


Rene Danzinger 
PRIVE, Austria 


Ricard Bruguera 
WEA, Spain 


xix 


Ricardo Jimenez-Peris 
LXS, Spain 


Richard McCreadie 
GLA, UK 


Richard Walsh 
BPFI, Ireland 


Rishabh Chandaliya 
NUIG-Insight, Ireland 


Roger Ferrandis 
WEA, Spain 


Roland Meier 
PRIVE, Austria 


Roman Benito 
LIB, Spain 


Sabina Podkriznik 
BOS, Slovenia 


Sara El Kortbi Martinez 
GRAD, Spain 


Saso Crnugelj 
BOS, Slovenia 


Silvio Walser 
BOC, Cyprus 


Simon Schou 
CPH, Denmark 


Simone Centellegher 
FBK, Italy 


Sofoklis Kyriazakos 
ISPRINT, Belgium 


Spyros Spanos 
ISPRINT, Belgium 


Stamatis Pitsios 
UBI, Greece 


XX 


Stathis Kanavos 
ISPRINT, Belgium 


Stefano Gatti 
NEXI, Italy 


Stelios Kotsopoulos 


AGRO, Greece 


Stelios Mantas 
NBG, Greece 


Stelios Pantelopoulos 
SILO, Greece 


Stephanie Jansen-kosterink 
RRD, Netherland 


Susanna Bonura 
ENG, Italy 


Susanna Bonura 


ENG, Italy 


Tanja Zdolsek draksler 
JSI, Slovenia 


Teoman Onat 
PRIVE, Germany 


Teresa Spada 
ABILAB, Italy 


Theodoros Kotzastavros 
CP, Greece 


Theodoros Arnaoutoglou 
CP, Greece 


Thomas Diesinger 
GEN, Germany 


Contributing Authors 


Thomas Krogh 
CPH, Denmark 


Thorsten Jansen 
DWE, Germany 
Tiago Teixeira 
UND Portugal 


Vaia Gousdova 
SILO, Greece 


Vasilis Koukos 
SILO, Greece 


Vasilis Koukos 
UPRC, Greece 


Vicent Sebastia 
WEA, Spain 


Vicky Foteinou 
CP, Greece 


Victoria Michailidou 
RB, UK 


Vito Morreale 
ENG, Italy 


Vittorio Monferrino 


GFT, Italy 


Yasar Khan 
NUIG-Insight, Ireland 


Ziga Bucaj 
BOS, Slovenia 


Abstract 


The large number of emerging FinTech companies and the transformation that 
financial corporates i.e. banks, credit and insurance companies are suffering as con- 
sequence of the new wave of disruptive human-centric services is changing the land- 
scape of the global financial economy, making the financial sector evolve very fast. 

The use of emerging technologies like BigData, Machine Learning Frameworks 
and Algorithms and Artificial Intelligence within the financial sector are only exam- 
ples of how the technology can catapult a large number of new user-centric services. 

In this second part of the series we review the basic concepts for Fin Tech referring 
to the diversity in the use of technology to underpin the delivery of financial services. 

The demand and the supply side in the financial sector are demonstrated, and 
further discussed is why FinTech is the focus of industry nowadays and the meaning 
for waves of digitization. 

Financial technology (FinTech) and insurance technology (InsuranceTech) are 
rapidly transforming the financial and insurance services industry. We provide an 
overview of Reference Architecture (RA) for BigData, IoT and AI applications in 
the financial and insurance sectors (INFINITECH-RA). 

Moreover, this book reviews the concept of innovation and its application in 
INFINITECH, and innovative technologies provided by the project for financial 
sector practical examples. 


DOI: 10.1561/9781638282310.ch1 


Chapter 1 


FINTECH Services 


11 FINTECH Services 


This book series aims to specify different aspects of each large-scale pilot: readiness; 
development and validation of different services and components. Validation is a 
core pillar, as one of the main objectives of INFINITECH is to test innovative (IoT, 
BigData, AI, ML, Blockchain and more) technologies towards improving business 
services in the Financial and Insurance sector. Specifically, the present deliverable 
reports on the readiness of the various pilot sites to test the INFINITECH innova- 
tive AI, IoT and BigData technologies into the testbeds/sandoxes that are developed 
during the project, while validating their ability to improve the business processes 
of end-user organizations (i.e. financial organizations, banks, and FinTech firms). 
[D7.1] 

In summary, this deliverable reports for each one of the pilots sites the following 
information: 


e A General overview of the status of the pilot, including its main business and 
technical objectives. 

* The development status of the different components and services that com- 
prise each pilot system. 


2 FINTECH Services 


* The status of the integration of a subset of their components as part of a 
Proof-of-Concept (PoC) pilot system. 

* Information on the availability and deployment status of the testbed/ 
sandbox, where the pilot’ final infrastructure will be deployed and validated. 


Pilots have already contributed to other previous deliverables and tasks (require- 
ments, user stories, security, policies, technologies, services, RA, etc). Therefore, 
this deliverable builds on top of these contributions. However, it also integrates and 
extends them, through illustrating how individual technical activities are enabling 
the integration and deployment of a complete pilot system with relevance for the 
end-users (i.e. financial organizations, banks, FinTechs). Overall, the present deliv- 
erable focuses not only on individual contributions, but rather on the overall readi- 
ness of the pilot sites and the pilots’ frameworks as a whole. [D7.1] 

The book presents the status of an initial PoC implementation for each pilot. 
This PoC enables a first demonstration of the viability and applicability of the var- 
ious INFINITECH technologies that support the pilots. The various pilots PoCs 
demonstrate the different developments accomplished up to date and serve as a basis 
for ensuring that the pilots’ developments are on the right track, while identifying 
points that need attention where required. 

The current status of testbeds/sandboxes is also an important part of the deliv- 
erable, because it directly affects the readiness and demonstrability of each pilot. 
Therefore, a quick overview of each testbed/sandbox is covered by each pilot. 

The book aims to aid an understanding of the overall progresses of pilots, serv- 
ing as an index to go deeper into different pilots’ achievements. Further, the book 
have started work on synergies and KPIs. Therefore, a preamble of this work is 
included. Finally, the book will make a first introduction to actions related to the 
main suggestions coming from the Review Report. [D7.2] 

This book includes the outcomes of the project in relation to pilot systems and 
pilot activities in the INFINITECH project. The information included reflects the 
activities and the operations conducted for the pilots of the project related to smart 
and reliable scoring, risk and service assessment. [D7.3] 

The current book clusters the activities, organization, and deployment of three 
different pilots, namely Pilot#1, Pilot#2, and Pilot#15. These pilots refer to spe- 
cific application fields: Scoring, Risk and Service Assessment. They all implement 
ML algorithms to business cases, aiming to enter the market with necessarily- 
novel approaches. Those pilots are data-driven and analyse different sources of data 
by pre-processing and converting such information into viable and effective data 
sources. [D7.3] 

This book contains an overview of each of the pilots listed above addressing sev- 
eral technical and operational aspects. Each pilot is briefly introduced, highlighting 
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Figure 1.1. Cluster #1 Pilot applications. 


three key questions: What is the problem it is intended to solve?; How is it solved?; 
and What are the main benefits? Subsequently, the pilots are described in terms of 
two main streams: the pilot systems and the pilot activities. [D7.3] 

The pilot system is intended to report the business services and each pilot pro- 
vides explanations of their innovation, and the list of technologies and components 
used while mapping them with the services. The pilot activities describe the opera- 
tions and roadmap towards the first cycle of development and beyond, along with 
descriptions and visualizations of the actual status of implementation, concluding 
with information about the performed or planned validation workshops. [D7.3] 

We report of the activities of Cluster #1, which is devoted to developing, con- 
ducting and operating the pilots of the project related to smart and reliable scoring, 
risk and service assessment (Figure 1.2). The pilots feature similarities in terms 
of their characteristics, yet they will be deployed using different technologies and 
based on different sandboxes. The pilots will be deployed and validated in three iter- 
ations, that will gradually advance the maturity of each of the pilot deployments. 
This approach is taken to ensure the proper technical and business validation of the 
pilot systems. [D7.3] 

The three Pilots #1, #2, #15, within the Cluster #1 refer to three specific applica- 
tions: Scoring, Risk and Service Assessment. All the pilots exploit ML algorithms, 
Big Data, and other technologies to address business cases, aiming to penetrate the 
market with needfully-novel approaches. Figure 1.1 below maps each pilot with the 
specific application field and shows graphically what are the major components of 
each of them. [D7.3] 

Pilot #1 (Invoices Processing Platform for a more Sustainable Banking Industry), 
deals with the extraction of information from Invoices, running ML algorithms 
on such data to analyze them and compare all the different sources to come up 
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Figure 1.2. Cluster #2 Pilot applications. 


with a Sustainability Index Scoring. The core part is the extraction, analysis and 
conversion of data intended as text (including tables) and images. AI technologies 
are applied to both the scanning of physical documents, and the development of 
automatized sustainability index scoring; as a consequence, this approach leads to 
cost saving and increased efficiency. [D7.3] 

Pilot #2 (Real-time risk assessment in Investment Banking), implements a real- 
time risk assessment and monitoring procedure of two risk metrics (VaR and ES) 
and market sentiment analysis to estimate market risks and allow updates with 
changing market prices and/or changes in portfolios in (near) realtime. Moreover, 
estimated changes in risk measures before a new trading position is entered will be 
implemented. Several stakeholders would benefit from such a pilot, mainly because 
it’s risk-driven and processes data and provides results in either real-time or near 
real-time. [D7.3] 

Pilot #15 (Open Inter-Banking Pilot), classifies the information contained in a 
subset of process-operating documents used by Italian banks, to build a business 
glossary respecting the ABI Lab taxonomy, thus supporting the Enterprise Archi- 
tecture Modelling. This pilot will allow the screening of extensive documentation 
in real-time, addressing a business pain shared by several banks, therefore being 
pre-competitive and strongly market-driven. [D7.3] 

This book also aims to specify different aspects of large-scale pilots that involve 
personalized recommendations to customers and customer centric analytics about: 
development, deployment, extension and validation of different services and com- 
ponents, that will be developed or used as part of cluster #2 for the INFINITECH 
Project. [D7.6] 
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The pilots feature similarities in terms of their characteristics, yet they will be 
deployed in different sandboxes and based on different technologies. All pilots will 
be deployed in three iterations, that will gradually advance the maturity of each of 
the pilot deployments, while at the same ensuring the proper technical and business 
validation of the pilot systems. [D7.6] 

The relative pilots listed below, intended to provide personalized financial prod- 
ucts and services for both investment and retail banking, include various personal- 
ized services based on customer centric analytics and personalized digital assistants, 
based on: 

Pilot #3: Customer Centric Analytics & KYC: Use the customer-analytics solu- 
tion in additional banking/finance processes commonly named Know Your Cus- 
tomer/Know Your Business (KYC/KYB); Move the solution to pre-production 
deployment for selected services (Partners: BPFI, NUIG, BOI, IBM). 

Pilot #4 Personalized Asset Management: Integrate more data in the PRIVE’s 
solution and expand the recommendations to additional products; Include in 
PRIVE's consulting services (Partners: PRIVE, RB). 

Pilot #5b Smart and Personalized Pocket Assistant for PFM & BFM tools deliv- 
ering a Smart Business Advise: Run a larger-scale pilot with the engagement of 
more customers of the bank; Offering respective functionalities also to corporate 
customers and moreover through integration with other third parties; Receive feed- 
back, improve usability and deploy in productions; Deploy similar solution to other 
banks and financial institutions (Partners: BOC, GFT, UPRC, CP). 

Pilot #6 Personalized Investment Recommendations for Retail Clients: Follow- 
ing business validation, prepare the solution for use in the NBG’s portfolio of retail 
solutions. Exploit the tailored sandbox of the solution for driving additional inno- 
vation inside the bank (Partners: NBG, CP, RB, UBI, LXS, GLA). 

In particular, we intend to provide an overview for each of the pilot above, 
answering to the following questions: 


e What is the problem and how the pilot development is addressing it? 

e What is the innovation that pilot brings either for business or technology? 
What are the technologies developed within or outside INFINITECH that 
are used for this pilot? 

e What is each pilot's development roadmap? 

* Which are the workshops planned with internal or external stakeholders for 
validation of the expected results and outcome for each pilot? 


We also specify different aspects of INFINITECH cluster £3 (Figure 1.3), a clus- 
ter of four pilot systems that involve Predictive Financial Crime and Fraud Detec- 
tion. It focuses on the development, deployment and validation of different services 
and components of the pilot systems. [D7.9] 
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Figure 1.3. Cluster #3 Pilot applications. 


The four pilots of the cluster feature similarities as far as their characteristics 
are concerned, yet they are deployed in different sandboxes and based on different 
technologies. All pilots are developed and deployed in three iterations, that will 
gradually advance the maturity of each of the pilot deployments, while at the same 
time ensuring the proper technical and business validation of the pilot systems. 
[D7.9] 

The related pilots intend to provide advanced financial products and services for 
banks, supervisory authorities, financial institutions, and governmental agencies, 
aiming to prevent and protect against financial crimes and fraudulent activities, as 
follows: [D7.9] 

Pilot #7 Avoiding Financial Crime: explore more accurate, comprehensive and 
near real-time representations of suspicious behavior in Financial Crime, Fraud, and 
cyber-physical attacks with the final objective of stealing bank customers identity 
and money. (Partners: CXB, FTS, FBK) 

Pilot #8 Platform for Anti Money Laundering Supervision (PAMLS): devel- 
opment of a Platform for Anti Money Laundering Supervision (PAMLS), which 
will improve the effectiveness of the existing supervisory activities in the area of 
anti money laundering and combating financing of terrorism by processing large 
quantity of data owned by the Bank of Slovenia and other competent authorities. 
(Partners: BOS, JSI) 

Pilot #9 Analyzing Blockchain Transaction Graphs for Fraudulent Activities: the 
aim of the pilot is to detect fraudulent activities monitoring blockchain transac- 
tions. (Partners: AKTIF, BOUN) 

Pilot #10 Real-time cybersecurity analytics on Financial Transactions BigData: 
improved detection of cases of suspected fraudulent transactions, to enable the 
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identification of security-related anomalies while they are occurring, by the analy- 
sis in real-time of the financial transactions of a home and mobile banking system. 
(Partners: PI, ENG) 

Pilot#16 Data Analytics Platform to detect payments anomalies linked to money 
laundering events: development of a data analytics platform, based on Machine 
Learning and graph database composition, to help the NEXI AML team to discover, 
monitor and analyze suspicious scenarios related to money laundering through dig- 
ital card payments. (Partners: NEXI, GFT) 

We describe the pilot prototype, the technologies that underpin each pilot sys- 
tem, as well as the innovative characteristics of each pilot system. It also elaborates 
on each pilot's development roadmap. Specifically: [D7.9] 


* Pilot#7 has already identified a valid dataset where all the transactions of the 
type "Immediate loans" are included from October 2020 to March 2021, in 
this dataset where fraudulent transactions have been tagged. Also, the dataset 
has been treated so as to anonymize the fields that could otherwise reveal per- 
sonally identifiable or confidential information. The dataset has been shared 
with the technical providers of the core technologies to support the pilot, 
FTS and FBK, with successful outcome resulting in an improvement of the 
efficiency for the fraud discovery in this type of transactions by applying AI 
to the data and process. 

* Pilot#8’s Risk Assessment tool is now in its final stage of development and is 
already in test and verification phase on Pilot£8 testbed. The Screening tool 
is in its main developing phase. Another important component of PAMLS 
platform is the pseudo-anonymization component, which enables regulation 
compliant data pseudo-anonymization in a way that the analytical results 
still represent valueadded information. Pilot£8 also contributes three com- 
ponents: Stream Story, Pattern discovery and matching, and Anomaly detec- 
tion and prediction. While Anomaly detection and prediction component is 
already developed, Pattern matching and discovery component is currently 
in the main phase of the development, while Stream Story component is in 
early stage of the development. 

e Pilot#9 is currently using its HPC based scalable system to analyze massive 
real Bitcoin and Ethereum cryptocurrency and ERC20 token transactions, in 
order to trace fraudulent activities. It is also able to take token transactions 
as input from permissioned ledgers such as the Hyperledger Fabric. A market 
analysis has been performed for the Pilot£9, attempting to assess size and 
nature of an industry as well as competition and regulations. It was also found 
that the Pilot#9 graph analysis system can have potential uses in the area 
of Central Bank Digital Currencies (CBDCs). In Period 2, implementation 
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of machine learning based analysis of blockchain transactions started. The 
objective is to predict fraudulent blockchain addresses and hence the problem 
was posed as a supervised ML problem. 

* Pilot#10 arises from the need to overcome the limitations of rule-based sys- 
tems to block potentially fraudulent transactions and the need to exploit the 
ML capabilities to more effectively identify new kinds of risky transactions. In 
the period M18-M27 of the project, the fraud detection system architecture 
was redesigned to more effectively support the continuous batch ML model 
retraining, according to the ML Ops best practices, which aim to deploy and 
reliably and efficiently maintain machine learning models in production. Two 
machine learning models are now adopted, one trained in an unsupervised 
fashion and one in a supervised one. 

e Pilot#16 has entered the INFINITECH project in September 2021, so the 
level of detail about the pilot status in this document is not at the same 
degree of the other pilots. The book features information about the use case/ 
data-based reference scenarios and business services, as well as the technology 
component foreseen for the pilot development and a roadmap. 


The book is devoted to developing, deploying, extending, and validating the 
pilots of the project that involve Predictive Financial Crime and Fraud Detec- 
tion. The pilots feature similarities in terms of their characteristics, yet they will 
be deployed in different sandboxes and based on different technologies. All pilots 
are deployed in three iterations. The clustering of the pilots is aimed at facilitating 
synergies between them, including knowledge exchange and best practice sharing. 
Pilot£16 has started lately, joining the project in September 2021. [D7.10] 

We also describe category/cluster 4, that involves personalized usage based insur- 
ance products and lists all the activities (development, deployment, extension and 
validation) related to the components, architecture, services and dissemination 
activities carried out by the clusters 4 pilots. [D7.10] 

Cluster 4 (Figure 1.4) comprises two pilots oriented to the insurance sector that 
exploit Io T based infrastructures to gather real world and real time data and develop 
AI powered services to enhance risk profiling. These pilots are building their own 
infrastructure (according to INFINITECH-Reference Architecture) by combin- 
ing INFINITECH and pilots specific technologies, configure their corresponding 
sandboxes and run their testbeds, all within the INFINITECH framework. They 
are deployed in three iterations, which will gradually advance the maturity of each 
of the pilot deployments and the validation of their brought business innovations. 
[D7.10] 

Category 4 pilots are intended to provide personalized insurance products and 
services based on IoT connected devices, including various personalized services 
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Figure 1.4. Cluster #4 Pilot applications. 


based on customer centric analytics and personalized digital assistants. These are: 
[D7.12] 

Pilot £11: Personalized insurance products based on IoT connected vehicles: 
Improve the risk insurance profiles using the information collected by connected 
vehicles and applying IoT, HPC, Cloud Computing and Artificial Intelligence 
technologies: drivers’ classification and fraud detection. (Partners: ATOS, CTAG, 
GRAD, DYN) [D7.12] 

Pilot 212 Real World Data for Novel Health Insurance: Improve the risk insur- 
ance profiles using the information collected by activity trackers & questionnaires 
and applying IoT & ML technologies (Partners: SiLO, iSPRINT, RRDD, GRAD, 
ATOS, DYN). [D7.12] 

The book contains an overview and status report for each pilot listed above, 
introducing the identified problem/s (as the seed/s of the business models), the 
way it/they is/are addressed, the involved innovations (technical and business ones) 
and the corresponding roadmap. In the scope of the present deliverable are also 
included the main achievements of each one of the usage-based insurance pilots, 
covering: [D7.12] 


* The design of the first prototype of each pilot system, in-line with the 
INFINITECH Reference Architecture. 

* The development of initial prototypes for both pilots, centered in data gath- 
ering and integration, and leveraging INFINITECH technologies (imple- 
mented in WP3/WP5), as well as pilot specific technologies. 

* The organization of a workshop for the validation and reception of stake- 
holder feedback about the usage-based insurance pilot systems. 
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The book provides the development, integration, and validation of the 
INFINITECH IoT-based pilots, oriented to the insurance sector and the person- 
alized usage-based services and products design. The work addressed by this task 
involves the end users’ engagement, deployment and integration of the required IoT 
systems, collection, storage and classification of data, identification and evaluation 
of proper ML/DL algorithms, applications/services development and deployment, 
stakeholders’ identification, technical and business innovations dissemination and 
a final business validation of the pilots involved in the INFINITECH Cluster/ 
Category 4: Personalized Usage-Based Insurance Pilots. 

This book series focuses on the progress and achievements related to the pilot's 
userstories, technologies and architecture which have already been referenced in 
that initial version. With this in mind, this document covers three main parts for 
each pilot: [D7.13] 


* Technological updates and new features implementation. 


o Both pilots (Pilot #11 in Section 2.2.1 and Pilot #12) have developed 
and evolved their AI models for Driving profiling and Risk Assessment 
and introduced XAI methodologies. Also, their testbeds (following the 
Infinitech way) have been updated. TRLs status are also shown. 


* End-users engagement and workshops 


o Stakeholder activities, workshops, and early adopters’ progresses summa- 
rized within a table. Both pilots have worked together to present new 
achievements to the stakeholders. 


This book series describes the use cases, pilots, and technical achievements of the 
personalised insurance scenarios (Cluster 4), covering pilots 11 (Motor Insurance) 
and 12 (Health insurance). It contributes with the final versions of the systems and 
applications developed within Cluster/Category 4 pilots, showing the final PoCs 
achieved by each pilot and sharing their components through the Infinitech Mar- 
ketplace. It also presents the final progresses in terms of technologies, services and 
outcomes offered to end users and to the Fin Tech marketplace. [D7.14] 

This document describes and presents in detail the following end-user services: 


* Pay How You Drive is the service developed by the insurance company utiliz- 
ing the drivers’ profiling and classification in order to adjust motor insurance 
premiums according to driving behaviour. 

* Fraud detection is the service for the insurance companies providing them 
with the real circumstances of an accident, so as to support detection of fraud- 
ulent acts against them. 

* Health risk assessment is the service provided to insurance professionals that 
utilizes the learnt health outlook models on the data of insured individuals, 
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to facilitate the professionals decision on the modification of their clients’ 
health insurance premiums. 

* Health fraud prevention is the service provided to insured individuals, that 
analyses the decisions of the above model to offer actionable advice to them, 
hopefully persuading them to use the provided measurement system truth- 


fully. 


In order to validate the cluster results, the insurance services developed within the 
cluster were presented to relevant sector’s stakeholders, including internal actuaries 
and external insurance companies to evaluate the listed services and orient their 
further evolution and exploitation. 

Pilot #11 Personalized insurance products based on IoT connected vehicles: 
Improve the risk’s profiles in motor insurance using the information collected by 
connected vehicles and applying Artificial Intelligence technologies: development 
of drivers. classification and fraud detection services. (Partners: ATOS, CTAG, 
GRAD, DYN) 

Pilot 212 Real World Data for Novel Health Insurance: Improve the risk insur- 
ance profiles using the information collected by activity trackers & questionnaires 
and applying loT & ML technologies (Partners: SiLO, iSPRINT, RRDD, GRAD, 
ATOS, DYN). 

Specifically, this third report is focused on Usage-Based Insurance (UBI) services, 
as Cluster #4, covering the pilots #11 and #12 Proof of Concepts. Pilot #11, centred 
on motor insurance applications, and pilot £12, dealing with health sector exploit 
the AI technologies applied to the real-world data captured from the insured clients 
(using IoT frameworks) to evaluate the real risk associated to the individuals and 
so, create specific product lines that can be customised according to defined profiles 
and/or specific individual's behaviour. The target is to evolve the way the insurance’s 
companies calculate their client's premiums, changing classical statistical techniques 
by real time data from users. [D7.14] 

This book series also focuses on category/cluster 5, that involves configurable 
and personalized insurance products based on alternative and automated insurance 
risk selection and insurance product recommendation for SME' and "Big Data 
and IoT for the Agricultural Insurance Industry and lists all the activities (devel- 
opment, deployment, extension and validation) related to the components, archi- 
tecture, services and dissemination activities carried out by the cluster 555 pilots. 
[D7.15] 

Cluster 5 is composed of two pilots that base their analysis on big data from 
different sources, both open and online and from satellite imagery to gather real 
world and real time data and develop AI powered services to enhance risk profil- 
ing (Figure 1.5). These pilots will develop their own architecture by combining 
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Figure 1.5. Cluster #5 Pilot applications. 


INFINITECH and pilots specific technologies, configure their corresponding 
sandboxes and run their testbeds, all within the INFINITECH framework. These 
will be deployed in three iterations, that will gradually advance the maturity of each 
of the pilot deployments and the validation of their brought business innovations. 
[D7.15] 

Category 5 pilots are intended to provide configurable and personalized insur- 
ance products based on alternative data sources and big data, including various 
personalized services based on customer centric analytics and personalized digital 
assistants. These are: [D7.15] 

Pilot #13: Alternative and automated insurance risk selection and insurance 
product recommendation for SME': Focuses by obtaining the data in open sources 
and the application of machine learning, the pilot will be able to monitor the 
changes in the risks, so we will be able to radically improve the risk management 
that companies face in the development of their daily activity. 

Pilot #14 Big Data and IoT for the Agricultural Insurance Industry: Provide 
Insurance companies with a robust and cost-effective toolbox of functions and 
services- allowing them to alleviate the effect of weather uncertainty when estimat- 
ing risk of AgI products, reduce the number of on-site visits for claim verification, 
reduce operational and administrative costs for monitoring of insured indexes and 
contract handling, and design more accurate and personalized contracts. 

The objective of this book series is to identify possible Pilot synergies in order to 
detect the same pattern of problems and/or situations between the Pilots. By iden- 
tifying these synergies, it enables the Pilots to collaborate with each other, trans- 
mitting their knowledge and visions to overcome problems and/or situations that 
arise. [D7.18] 

To identify synergies, it was necessary to analyse the User Stories that were estab- 
lished. The identification of these synergies is also very relevant to the work related 
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to the entire WP7 and its respective deliverables, such as the collaboration between 
Pilots in solving problems and/or use of technologies. [D7.18] The synergies listed 
in the book series have been divided into categories. These categories were defined 
at the beginning of the project, whose order was maintained to divide the Pilots by 
these categories and by synergies. [D7.18] 

One of the main objectives of the INFINITECH project is to introduce, validate 
and evaluate advanced BigData and Al-based Digital Finance services in real-life 
pilot settings. The basis of the framework that will be used to evaluate the fifteen 
pilots that make up the whole project is described. As a matter of fact, this deliv- 
erable represents the inception of a sequentially driven strategy, a.k.a. Evaluation 
Framework, made up of two main phases: the first phase, is meant to find across- 
the-board Key Performance Indicators (KPIs) as to obtain a standardized way of 
evaluating the pilots; the second phase is focused on opening the way to a full- 
fledged periodic evaluation of the pilots’ progress (which will be subject to future 
refinements as to improve its efficacy in its profiling activities, as well as to reduce 
the pilots burden into providing periodic feedback). [D.20] 

It is therefore illustrated the status of the Pilots KPIs, as to embody an initial 
snapshot from which to base the evaluation, as well as the methodology that will 
be used to carry out the future monitoring process (which belongs to the second 
phase of the framework). [D.20] 

Such outputs are based on the continuous interaction with the pilots, from which 
ABI Lab obtained an understanding of the aspects of the diverse use-cases, such as 
figuring out the needs over the KPIs, finding a proper terminology to encompass all 
pilots’ use-cases, defining the number of requested KPIs per category, standardizing 
their format, who are the pilots that already achieved their KPIs, etc. 
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Applications and Services for the Financial Sector 


The Interoperable Data Exchange and Semantic Interoperability focuses on estab- 


lishing the foundation for common, shared meaning across the several data sources 
and message and event feeds within the INFINITECH platform while facilitating 
the technical implementation of the INFINITECH principles. In this landscape, 
there are defined a set of objectives: [D4.1] 


1. 


Defined shared semantics (ontologies) for semantic interoperability of Big- 
Data and IoT streams in the finance/insurance sectors; 

Provide the means for scalable the massive analytics over linked semantic 
streams; 

Provide a permissioned blockchain solution for exchange data across different 
organizations in the finance and insurance supply chains; 

Enhance the permissioned blockchain of the project with tokenization func- 
tionalities, as means of enabling digital assets trading; and 

Implement techniques for secure querying of encrypted personal data over a 


blockchain. 
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Taking into account the overall objectives, the following set of activities are iden- 
tified as main outcomes of the INFINTECH project objectives: 


e Shared Semantic for BigData and IoT Streams: This activity specifies mod- 
els and ontologies for semantic interoperability of diverse applications in the 
finance and insurance sectors. It extends and integrate ontologies such as 
Financial Industry Business Ontology (FIBO)/Financial Instrument Global 
Identifier (FIGI) with additional concepts associated with INFINITECH 
applications and testbeds. The task produced the project's ontology for 
semantic interoperability, which will provide the concepts needed for anno- 
tating and linking diverse data streams. 

* Massive Distributed Processing of Semantically Linked Streams: This task 
provides a prototype implementation of the Super Stream Collider (SSC) 
engine, called SeSA-ME (Semantic Stream Analytics Engine/Middleware) 
that enables analytics for semantically linked streams (linked data). The 
engine is scalable and suitable for massive parallelization in cloud environ- 
ments. It is implemented on top of SSC component, which is customized 
to support linked data in-line with the shared semantics specified in INFIN- 
TECH priorities. 

* Distributed Ledger Technologies for Decentralized Data Sharing: This task 
implements permissioned blockchain infrastructures based on Corda R3 
and/or the open source Hyperledger Fabric project. These blockchains are 
customized to support the requirements of the financial sector, including data 
models, authentication and authorization mechanisms, as well as APIs for 
implementing Ledger Clients for financial/insurance sector applications. The 
infrastructure is integrated to existing BigData/IoT platforms in the testbeds, 
based on appropriate ledger clients. 

* Tokenization and Smart Contracts Finance and Insurance Services: This 
activity enhance the permissioned blockchain with cryptographic tokeniza- 
tion features, as a means of enabling assets trading. Likewise, the activity spec- 
ifies and implement Smart Contracts for adding and retrieving information 
on the tokenized blockchain for all the essential data exchange use cases of the 
projects pilots. The applications provide the means for trading access to data 
and information through the permissioned blockchain. The activity speci- 
fies and implement ledger protocols for the financial/insurance applications, 
including trading protocols. 

* Secure and Encrypted Queries over Blockchain Data: This activity imple- 
ments and provide a framework for querying encrypted data over the project's 
permissioned blockchain infrastructure. It exploits and customize algorithms 
from the OPAL project, based on Multi-Party Computation (MPC) and 
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Linear Secret Sharing (LSS) schemes (i.e. homographic encryption). The 
mechanisms implemented resemble Enigma’s (enigma.io) Personal Data 
Management infrastructure, through the integration of consent mechanisms 
that enable consumers/customers to provide consent for access to their data 
through the blockchain. In conjunction with the trading and tokenization 
functionalities of the blockchain, this activity created the foundation for a 
personal data market where customers are able to trade their data in exchange 
for tokens on other assets. 

Situation Awareness Front-End over Aggregated Information: This activity 
provides a web-based framework for the visualization ofthe aggregated results 
of analytic algorithms developed in the scope of the INFINITECH project, 
and more generally of all information of relevance. The framework is based 
on the community edition of Knowage, an OS solution for BI, which is part 
of the OW2 community. The Knowage suite ia extended and customized in 
order to support specific data models and persistence technologies. The visu- 
alization functionality allow users to assemble personalized dashboards for sit- 
uation awareness, wiring together related information from different sources. 
Special emphasis was paid in visualizing information from distributed ledgers. 


Background and Related Works 


This section is intended to frame the research realized under the scope of the 


Shared Semantic for BigData and IoT Streams [D4.1 to D4.3]. It establishes a 
common ground and a necessary foundation to support the design and definition 
of the proposed methodology for developing INFINITECH models and ontolo- 


gies for semantic interoperability while avoiding any misunderstanding regarding 


INFINITECH main concepts. 


Syntactical Interoperability 


Technical Interoperability 


Semantic Interoperability 


Organisational Interoperability 


Figure 2.1. Different Interoperability levels according to [6]. 


Concepts and Definitions 


Interoperability 


There is no unique definition of interoperability in the literature since the concept 
has different meanings depending on the context. As a matter of fact, according to 
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Figure 2.2. Interoperability in INFINITECH perspectives and Task 4.1 main focus [8]. 


ISO/IEC 2382-01 [1] interoperability is: “The capability to communicate, execute 
program, or transfer data among various functional units in a manner that requires 
the user to have little or no knowledge of the unique characteristics of those units". 
According to Next Generation Networks (NGN) from ETSI technical committee 
TISPAN [2], interoperability is: “the ability of equipment from different manufac- 
turers (or different systems) to communicate together on the same infrastructure 
(same system), or on another". EICTA defines interoperability as [3]: "the ability of 
two or more networks, systems, devices, applications or components to exchange 
information between them and to use the information so exchanged". Although the 
particular definition of interoperability is always about making sure that systems are 
capable of sharing data between each other and to understand the exchanged data 
[4]. In this scenario the word "understand" includes the content, the format, as 
well as, the semantic of the exchanged data [5]. Interoperability ranges over four 
different levels [6] namely: 


I. physical/technical interoperability: concerns with the physical connection 
of hardware and software platforms; 
II. Syntactical interoperability: concerns with data format, i.e. it relates on how 
the data are structured; 
III. Semantic interoperability: concerns with the meaningful interaction 
between systems, devices, components and/or applications; and 
IV. Organizational interoperability: concerns with the way organizations share 
data and information. 


Interoperability in INFINITECH 


The first three interoperability levels are part of the INFINITECH platform and 
handled in Task 4.1. INFINITECH Semantic models and Ontologies are the final 
result of an exercise that takes as inputs physical and syntactical interoperability 
aspects already analysed in WP2 Task 2.1 — User's Stories and Analysis of Stake- 
holders’ Requirements, Task 2.5 — Open Banking APIs, Testbeds and Data Asset 
Specifications, Task 2.6 — Specification and Design of Integrated Data Models and 
Task 2.7 — Reference Architecture for BigData, AI and IoT in Financial Services 
Industry. 


18 Applications & Services for the Financial Sector 


As stated in [7], nowadays ICT solutions — in the most desperate con- 
text of application from e.g. manufacturing, healthcare, automotive, white 
goods, logistics, finance, etc. — comprise several distinct elements — e.g. devices, 
communication infrastructures, services, applications etc. — typically distributed 
and heterogeneous that need to cooperate and communicate with each other. 
However, communication between two systems is more than the particular network 
protocol to be used. Several aspects need to be considered whenever a communica- 
tion channel between two systems needs to be established. As a matter of fact, the 
information flow within an ICT system and/or platform ranges from information 
detection from the data extraction, data transformation, data provisioning, data 
processing and data usage. In such a context, interoperability represents the enabler 
and the facilitator for this flow. As shown in Figure 1.3, interoperability can be seen 
from different perspectives, however Task 4.1 is restricted to discussing the semantic 
interoperability and thus data models, information models and ontologies. 


Semantic Interoperability 


Semantics plays a main role in interoperability for ensuring that exchanged infor- 
mation between counterparts are provided with sense. For Computer Systems, this 
notion of Semantic Interoperability translates in the ability of two or more systems 
to exchange data between them, by means of adopting it with precise unambiguous 
and shared meaning, therefore allowing its readily access and reuse. 

Since around the nineties of the past century, the emerging concept of Seman- 
tic Web [9], coined by World Wide Web (WWW) founder Tim Berners-Lee, has 
been conducted by an exhaustive research and industry applicability, turning itself 
has base fundamentals to Semantic Web Services and the latest Semantic Inter- 
net of Things (IoT) concepts [10-12]. All of them aim to carry out collabora- 
tion across semantically heterogeneous environments, contributing to a connected 
world of consuming and provisioning devices that can potentially exchange and 
combine data to potentially offer new or augmented services. However, accom- 
plishing this vision has raised several challenges due to the varied standards, legacy 
systems constraints, tools, etc. currently in use worldwide. 

The Semantic interoperability process can, therefore, focus on different view- 
points of semantic aspects, such as the exchanged data description or the systems 
interaction terms. As example, the interoperability specification beside defining the 
meaning ofa given sensor, it can also provide information on the units of such value 
or what protocols to use in order to connect and extract the value from the provider 
device. 
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Semantic Models 


The provision of semantic information modelling can be granted with several 
types, including key-value, mark-up scheme, graphics, object-role, logic-based and 
ontology-based models [13]. From this set, the keyvalue type offers the simplest 
data structure but lacks expressivity and inference. On the other hand, the ontology- 
based model provides the best way to express complex concepts and inter-relations, 
being therefore the main trend model used for elaborating semantic models. 


Ontologies 


Since semantic web has started to gain shape, its inherent semantic interoperability 
has been mostly grounded on the use of ontologies for knowledge-representation 
basis. In this sense, usually there exists a top-level ontology (or domain ontology), 
and multiple sub-domain ontologies, each one representative of a more specific 
domain. With the use of ontologies, the entity is provided with comprehen- 
sion [14]. 


Semantic Annotations 


Semantic annotation is the process of attaching additional information to any ele- 
ment of data encompassed in some sort of document. Ontologies on their own 
are not sufficient to fulfil the semantic interoperability requirements to enable data 
readability by machines, as there may be differences and inconsistencies. Seman- 
tic annotation has been widely used to fill this gap by creating links between the 
disparate ontologies to the original sources [15]. 
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3.1 Configurable and Personalized Insurance Products 


Pilot 415: Configurable and Personalized Insurance Products 
for SMEs 


The pilots objective is to obtain forms of profiling of insurance needs for small 
and medium enterprises, in order to know their risks better and in this way to be 
able to offer a selection of products and coverage in a personalized way. The pilot 
will implement an automation of the subscription process that helps the insurance 
company reduce costs. In addition, being able to verify that the data entered is 
correct with a double verification avoids possible errors in the cost of the insurance 
premium (Figure 3.1). The monitoring and identification of real-time risk changes 
allows the company to know if the insurance cost really corresponds to the real 
risk of the SME or if it should increase or decrease it to adapt it to its current 
situation. This is based on the collection of information from these companies in 
open and alternative data sources to those traditionally used by insurers. To achieve 
it, the platforms robots filter and track the company’s fingerprint on the internet in 
various open sources, ranging from social networks, public official records, opinions 
web, the company's own web pages, etc. 
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1. INPUT [1] 
2. rEcHNoLocv ge OPEN DATA MATCHING BIG DATA & © 
3. RESULTS olal 
4. OUTPUT DOWNLOAD DATA / ONLINE PLATFORM [1] 


CROSS-SELLING RISK PREVENTION 
ACTIVATE CLIENTS TAILORED OFFER 


Figure 3.1. Pilot #13 summary. 


5. TAKE ACTION 


Pilot #13 will monitor risk’s changes, so it will be able to radically improve the 
risk management that companies (SMEs) face in the development of their daily 
activity. The indicators will be based on information from each of the companies 
coming from online sources that will give information about the digital presence 
and activity os those companies like activity, business volume, participation in 
social networks, number of employees, use of ecommerce, payment platform etc, 
etc (Figure 3.2). The company to be analysed does not need to provide much 
information, developed tools are in charge of searching and gathering information 
related to his company from many sources. In this way, risk profiles of each of the 
companies analysed will be generated, allowing to customize the product offering 
and to make a permanent automated risk management. But this is not the only 
usage of data, insurance companies will use this information, resulting on better 
customized products. 


Technological components and Services 


This book series links the shown software components with the corresponding RA 
layers, providing some details about their implementation (Figure 3.5). 


* Data Sources (infrastructure). To obtain the data from the information 
sources we will use the automatons developed by Wenalyze, based on exten- 
sions and instances. 

* Data Management (Data Collection and aggregation), For data management 
we will rely on LeanXcale with its non-relational databases, its data manager 
and its polyglot. 

* Analytics, For the application of the models developed in ML we will use 
the Wenalyze platform that will connect with the testbed in NOVA through 
Javasceript. 

* Connectivity, Finally the connectivity is foreseen through an API rest, but to 
facilitate the realization of the PoC also has developed connectivity through 
the use of browser, access by users and password or the upload of files in CSV 
format. 
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Architecture, | Memory, 32 | Storage, 500 | Storage Network 
X86 64 Gb ofRAM | Gb type, SSD performance 
25 Gigabit 


Linux SSH access Node.js and | GiTHub NginX server 
operating NPM 

system 

(Debian like 

Ubuntu, no 

GNU) 


Figure 3.3. Pilot #13 hardware/software requirements. 


Testbed 


The Pilot 13 will implement an automation of the subscription process that helps 
the insurance company reduce costs. In addition, being able to verify that the data 
entered is correct with a double verification avoids possible errors in the cost of the 
insurance premium The monitoring and identification of real-time risk changes 
allows the company to know if the insurance cost really corresponds to the real risk 


of the SME or if it should increase or decrease it to adapt it to its current situation. 
The infrastructure that will be used will be place in UNINOVA. 


* [n Nova hosting just will be implement the data base system by LeanXcale. 
* The solution consists in make transactions and be storage and manage in 
non-relational databases. 


Other non-technical requirements 


A wealth and variety of important data is essential for the proper functioning of 
pilot #13. The data are obtained from open sources through which we can obtain 
different information related to the real time activity of the companies. We will use 
two sources of data, one that is available internationally and a second that must be 
incorporated in each of the countries. These secondary data sources per country are 
not always available with the same information and this, although avoidable, can 
complicate the development of the pilot. 


Implementation of a first Proof of Concept 


This service makes it possible to monitor the risks of SMEs now and in the future 
and therefore improves the control of the accident rate, the renewal of insurance 
policies and offers personalised insurance cover (Figure 3.4). 

The RA that will be used are Data processing and data analytics. Related to 
the information that will be use as a input in the proof of concept will be, SMEs 
website data, opinion platforms, business directories, social media and ecommerce 
platforms (Figure 3.5). The PoC will be run based on data recovered from Spanish 
companies. 
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Figure 3.5. Data analyzing routine for Pilot#13. 


Expected Outcomes 


* Better knowledge of the behaviour of SMEs in relation to the risks they face 


in their activity. 


* Reduction of the necessary information of introduction manual for the quote 


of policies. 


* Increase the automation of the level of risk determination and of the coverage 
and services that are adapted to the needs of each SME based on their activity 


and risk. 


* Design of insurance products adapted to the needs of SMEs. 


Datasets 


Data will be extracted from open sources such as company websites, official regis- 
ters, social networks, opinion forums, etc. Data will include 150.000 SME targets 


with 50.000.000 data fields. 


* SMEWIF: SMEs website information and functionalities. Description of the 
text containing in the website of the companies, services and structure of the 


company. 


e ROPS: Review and opinions platforms. Reputation information and opin- 


ions of clientes about productos ans services. 
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* EUBD: European SMEs Business Directories. Oficial and legal information 
about the comanies, social object, activitie, other companies where they have 
equity. 

* GIO: SMEs geolocation information and characteristics, images and geo- 
graphical information. 

e S MSIP: Social media SMEs information and presence. 

* I&R: Key performance indicators and insurance needs. 


The Pilot will also use synthetic data. P13-Alternative/automated insurance risk 
selection — product recommendation for SME SMEs synthetic raw data. 


Data Produced 


The output will be the ERP (Enterprise Risk Profile) and EIAU (Enterprise Insur- 
ance Automated Underwriting). 

With these two outputs, not only the risk profile and its levels of any SME com- 
pany are obtained, but also the information for the automation of the subscription 
and the application of rules to obtain the price automatically. 


Explainable Workflow 


Pilot #13 is a “Big Data" data analysis platform applying ML (Machine Learning) 
and AI (Artificial Intelligence) technologies to better predict the insurance needs of 
SMEs. 

Well, this system must be prepared to offer a commercial use to the companies, 
so it must have a user interface so that they can manage the information (Frontend) 
and a management layer at a logical level (Backend). 

The companies (enterprises) will access our platform through a registration pro- 
cess and subsequent validation by assigning a package of number of customers, 
the basic and commercial information will be recorded in Amazon Cognito and 
the logical information of the company will be recorded in a table of DynamoDB 
called Enterprises. 

With regard to the use of the information by the companies, the user must load 
the information they have stored in their systems in our platform, this will receive 
the name of raw data (crude-data). The raw data will be uploaded to the platform 
as structured information in CSV format. The companies that use our service will 
have a limited amount of clients loaded in crude-data, for this, the fields of the 
Enterprises table, limit, clients_uploaded, total_clients_uploaded will be used in a 
monitored way. 

Each row of this document will identify a client, which can be target in dif- 
ferent sources of information on the Internet and other open sources in real time, 
depending on the information available (the quality of information depends on the 
company), which will be recorded in the DynamoDB Targets table (Infrastructure). 
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Once the robots can obtain the least updated target for a particular source, they 
will proceed to obtain the information and subsequent storage in big data, this 
information container will be in Amazon S3 in a loop called big-data as well as 
stored in the folder with the name of the companys identifier. The information 
obtained from the source will be stored in the folder mentioned above in a JSON 
document whose name will be the users identifier. (Data Management) From this 
layer the analysis algorithms will be applied and the results of the analyzed com- 
panies will be shown with the indicators of risk levels and the configuration and 
automation of the underwriting obtaining the ERP (Enterprise Risk Profile) and 
EIAU (Enterprise Insurance Automated Underwriting) (Data Processing and Visu- 
alization). 

It is important to note that the quotes to services provided by AWS are for illus- 
trative purposes and provided they have the same technical and technological char- 
acteristics they can be replaced by another supplier, as in the case of the project 
under consideration may be NOVA and LeanXcale. 


Logical Schema 


The following Figure 3.6 provides an INFINITECH-RA compliant logical view of 
the logical architecture of the pilot. 

Pilots Reference Architecture and main data flows have been presented. In sum- 
mary, the main components to be developed in this pilot are: 


* Data Sources layer, that through information collection, select and obtain 
the information from dozens of sources in an efficient way, minimizing the 
necessary computation. 

* A Data Management layer, that selects, captures, and curates the data sources 
required to implement the pilots functionalities. This information manage- 
ment allows data collection sources to efficiently dump the data into non- 
relational streaming databases. 

* An Analytics block, fed by the data layers, where different ML/DL tech- 
nologies and visualization tools will enable data monitoring, analysis, and 
exploitation. Two main AI models will be developed to cover pilot’s uses cases. 


In pilot 13 the main participants are Wenalyze and LeanXcale. LeanXcale will 
provide the Data Management and Data Processing components, while Wenalyze 
will provide the infrastructure for obtaining data from open sources, the Analytics 
part, User interaction and the Visualization part. The end users of the information 
obtained and processed by the platform are insurance and reinsurance companies 
and banks. Banks regard the sale, underwriting and control of risks from their 
business clients and SMEs. First contacts with different insurance companies have 
already been made. The comments have been very positive, and the pilots would 
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Lead 
[Any type of contact point made) 


Marketing Qualified Lead 
[Several Meetings, No NDA] 


Sales Qualified Lead 
[Signed NDA] 


In Budget 


] Commercial Insurance Companies, 
Bancassurance, Reinsurers 


/ 6 insurance companies, 2 banks in 2 countries 


2 insurance companies, 1 reinsurer, 1 bank in 3 
countries 


| 4 insurance companies, 1 bank, 1 large broker in 
3 countries 


2 insurance companies, 1 bank in 2 countries 
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[Project Included in their Budget] 


Figure 3.7. Pilot #13 customer acquisition funnel. 


start once the algorithms and the platform are implemented. Also, a communica- 
tion and conversion funnel has been created and is being distributed in both Euro- 
pean Union countries and the United States. At present, 16 insurance and banking 
entities from eight countries are in this funnel and are in the process of marketing 
qualified lead to sales qualified lead. Different proofs of concept have been already 
agreed. The actual conversion funnel regarding end user is (Figure 3.7): 

The development of the pilot is very positive, and it is expected to be completed 
in the time foreseen by the consortium. 


Components 


The following INFINITECH component will be used as part of this pilot: 


* Data Acquisition Layer (Data Ingestion in RA); The data acquisition layer is 
composed of microrobots that roam the data sources. The deployment of the 


micro-robots is discretionary depending on speed and analysis needs, being 
a flexible and scalable deployment; 

* INFINISTORE (HTAP data store and the polyglot engine) (Data Manage- 
ment in RA); 

* Analytics Layer (Analytics and Machine Learning in RA); 

* Connectivity layer through API-Rest. 


Conclusions - Issues and Barriers 


At this moment, the Pilot #13 is progressing according to the plan. The intake of 
information is neither presenting any problem, nor the construction of the models. 

The only delay with respect to the plan is the transfer of the AWS development 
to the Nova testbed. For the time being, this is being managed by LeanXcale, this 
delay should not impact on the deadlines set for the pilot. 
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Finally, to point out the efforts on the commercial promotion of the tool, starting 
with communication in forums of the sector in the European Union and obtaining 
the first leads for the conversion funnel. 


Pilot #14: Big Data and loT for the Agricultural Insurance 
Industry 


The objective of Pilot #14 “Big Data and IoT for the Agricultural Insurance Indus- 
try” is to deliver a commercial service module that will enable insurance companies 
to exploit the untapped market potential of Agricultural Insurance (AgI), taking 
advantage of innovations in Earth Observation (EO), weather intelligence & ICT 
technology. EO will be used to develop the data products that will act as a com- 
plementary source to the information used by insurance companies to design their 
products and assess the risk of natural disasters. Weather intelligence based on data 
assimilation, numerical weather prediction and ensemble seasonal forecasting will 
be used to verify the occurrence of catastrophic weather events and to predict future 
perils that could threaten the portfolio of an agricultural insurance company. The 
INFINITECH AglI-module derived indices will allow and enable the agricultural 
insurance industry to enlarge its market, while delivering a larger portfolio of prod- 
ucts at lower costs and serve areas, where classical insurance products could not be 
delivered (Figure 3.8). 

Also, the aim is to define, structure and pilot test specific services for the Agri- 
cultural Insurance sector in order to better protect agricultural assets by evaluating 
risks in a data-driven way and to improve the business process of agricultural insur- 
ers and clients (farmers). The services tested will be (1) a mapping of risks related 
to agriculture in predefined markets, (2) the prediction and assessment of weather 
and climate risk probability and (3) a damage assessment calculator for insurance 
companies. 


Technological components and Services 


Based on the reference architecture the following components and services will be 
deployed and used as part of the pilot: 
ICT Modules 


e Octopush EO Service: Octopush EO Service is an integrated satellite derived 
software service, which collects earth observation, geospatial, in-situ and 
other geo-referenced data, it applies appropriate processing algorithms and 
returns the results in a ready-to-use format. 

* AgroApps Weather Intelligence Engine (AgroApps WIE): The WIE is an 
integrated weather derived software service which collects weather informa- 
tion from several resources and along with the georeferenced data, it applies 
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appropriate processing algorithms and returns the results in a ready-to-use 
format. 

Data integrator: The Data Integrator acts as a bridge between the WebGIS 
subsystem, Octopush EO service and WIE. It is responsible for performing 
the essential scheduled calls to the data providers in order to fetch and process 
the desired EO and weather information. It is able to run calls on demand or 
daily data integration tasks by retrieving EO data and weather products from 
Octopush EO service and WIE and transforms, binds, injects those into the 
WebGIS database. 

Business and Geospatial DB: Business DB offers a storage layer essential to 
carry the business logic and relevant information/data stored and managed by 
API. It also stores, retrieves and provides information related to user accounts, 
settings, actions and preferences. The geospatial data storage and data persis- 
tence mechanisms allows the storage of the geometries and zonal statistics 
and provides the essential functionality for querying and retrieving data via 
an API or WMP server components. 

Web Map Server (WMS Server): WMS is responsible for rendering and serv- 
ing of the GIS layers to the User Interface. 

RESTful API: The API will act as a communication and data exchange 
bridge, that allows the platform to share processed and structured content 
internally, between the different components. 

User interface: The front-end user interface is the gateway responsible to 
present all the system data through user-friendly controls and web mapping 
interfaces. 


Testbed 


All modules of Pilot 14 services will be hosted in AgroApps premises, except the 
weather intelligence engine that will be deployed in UNINOVAS infrastructure. In 


this sense, server specifications will be defined at a later stage. 


Others non-technical requirements 


Besides the technical requirements for the pilot, there are also other non-technical 


requirements in order to test the application successfully. These requirements 


mainly relate to the data provided by the agricultural insurance company: 


* [n the past, we observed that the quality of the data provided by agricultural 


insurers was often poor. This is mainly due to the IT structure of the insurers, 
which often does not allow targeted queries at short notice. 


* However, in order to apply the structured and unstructured data provided in 


the shared testbed by AgroApps and UNINOVA (Earth Observation (EO), 
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Numerical Weather Prediction Data, Reanalysis Data and Seasonal Climate 
Forecasts) to the insured/to be insured regions, the data provided by the insur- 
ance company needs to be accurate, timely and on a correlating spatial reso- 
lution. 

* This applies not only for the clear identification of a region/field by coordi- 
nates or IDs from national databases, but also to the existing/desired form of 
insurance cover, the crop to be insured, average yield values and a (potential) 
loss history. 

e Ifthe data quality is insufficient, the national statistics office could also be 
consulted, e.g. for average yield data. 


Furthermore, Pilot #14 make use of the respective national network of Weather 
stations for collecting data used in the Weather Intelligence Engine to predict 
weather and climate patterns. 


Implementation of a first Proof of Concept 


The first Pilot #14 Proof of Concept (PoC) will focus on a data processing architec- 
ture and a data analytics infrastructure to create an Area Risk Profile for the defined 
Area of Interest (AOI) in order to assess the risk of natural disasters and to develop 
a pricing framework for a drought index product. 

Therefore, EO data derived from satellites and weather intelligence based on 
data assimilation, numerical weather prediction and ensemble seasonal forecasting 
will be used to verify the occurrence of catastrophic weather events and to predict 
future perils that could threaten the portfolio of an agricultural insurance company 
(Figure 3.9). 

In this first phase of the pilot site preparation, the focus will be on providing 
solutions for users situated in agricultural insurance companies (Actuaries, Under- 
writers, Sales Agents, Loss Adjuster) as described in the User Stories #14.01-14.08). 

By combining the components developed in the AgroApps and UNINOVA 
Infrastructure and the data set from the insurance company, the respective user 
application can be set up and tested. The results of this first PoC will help to improve 
the data flow and data analytics processes for the Pilot’s final services (Figure 3.10). 


Expected Outcomes 


* Identification of areas within the large-scale pilots where crop productivity 
and catastrophe probability are high based on intelligent risk mapping. 

* Creation of additional datasets with high predictive value for improving 
underwriting of agricultural risks with regard to weather and climate risk 
probability. 

* Improved damage assessment and claims handling procedures for the insur- 
ance industry to increase the efficiency of calculating indemnity pay-outs. 
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insured value Index value, derived from historical average (high correlation with field 
variable(s) (e.g. weather or yield data} 


Client risk retention % of historical average (area above trigger] 
Payout limit Maximum payout: % of historical average [area between trigger and exit) 


framework for index A 
insurance pricing 


» 


Figure 3.10. Pilot #14 Data requirements for developing a pricing framework. 


Figure 3.11 shows the workflow for configurable and personalized insurance 
products for SMEs and agro-insurance. 


Datasets 


The main data source for the pilot is produced by satellite and a weather intelli- 
gence engine. The Earth Observation (EO) data will be derived from the satellites 
Sentinel-1,2,3, LandSat-8, MODIS and PROBA-V). Also, numerical weather pre- 
dictions for the pilot areas (gridded data) are generated each day and will replace 
the previous prediction. Lastly, gridded climate indices based on ERA-5 Land and 
ERA-5 Reanalysis Data will be used for the pilot. 

Following, the list of datasets is presented (GEN): 


© Gridded Climate Indices (1/1/1979 to 31/12/2019): Climate Indices based 
on the ERA-5 Land and ERA-5 Reanalysis Data. 

* EO Data: Earth Observation Data (Sentinel-1,2,3/LandSat-8, MODIS, 
PROBA-V) for remote damage and crop loss assessment. 

e Numerical Weather Predictions: Very High-Resolution Weather Predictions 
for the Pilot Areas. 

e Temperature, precipitation, evapotranspiration, soil moisture, crop growth 
data, crop water requirements, wind speed, relative humidity,solar radiation, 
snow cover, snow depth, etc. 

* Hail data. 

* Loss history. 


Data Produced 


The data produced will result in a solution for Agricultural Insurance companies 
allowing them to efficiently couple EO satellite data and weather/climate data 
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Figure 3.11. Configurable and personalized Insurance product for SMEs and agro- 
insurance. 


with any type of complementary data (from separated drone shots to ultra-high- 
resolution SAR imagery). The INFINITECH Agl module will enable Insurance 
companies to alleviate the effect of weather uncertainty when estimating risk for 
Agl products, reduce the number of on-site visits for claim verification, reduce 
operational & administrative costs for monitoring of insured indexes and contract 
handling, & design more accurate & personalized contracts. By deriving impar- 
tial indices on top of a multitude of data, the module will allow insurers to reduce 
significantly the time needed for handling and verification of claims and the costs 
imposed by fraud, moral hazard and adverse selection. 


Explainable Workflow 


The data produced by the Octopush EO Service (Crop Monitoring, Pest & Disease 
Services, Damage Assessment Services) and the Agro Apps Weather Intelligence 
Service (Weather Forecast Services, Climate Services) is pipelined into a Data inte- 
grator. The integrator feeds back Metadata and information on the Area of Interest 
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Figure 3.12. Bigdata and loT for agricultural insurance industry pilot pipeline in-line with 
the IRA. 


(AOI) to the data producing services. The integrator itself is retrieving the AOI 
& Metadata from a Business DB storage layer. The geospatial data storage and 
data persistence mechanisms allows the storage of the geometries and zonal statis- 
tics and provides the essential functionality for querying and retrieving data via an 
API (alerts) or WMS server components (vectors source). The WMS server is then 
responsible for rendering and serving of the GIS layers to the User Interface. The 
restful API will act as a communication and data exchange bridge, that allows the 
platform to share processed and structured content internally, between the different 
components. The front-end user interface is the gateway responsible to present all 
the system data through user-friendly controls and web mapping interfaces. 

Figure 3.12 depicts the logical schema for Bigdata and IoT for agricultural insur- 
ance industry pilot pipeline in-line with the IRA. 


Logical Schema 


AgroApps is developing the entire infrastructure for the pilot #14 data products, 
based on the reference architecture starting from data collection from different 
sources over processing and analytics to user interface & data visualization. The 
ongoing development of the service module is based on scientific research in the 
field of agricultural insurance, climate & weather risk modelling and the most 
recent evolutions in the area of remote sensing technologies. The reason for this 
is that these three areas will play a crucial role for the future of agricultural insur- 
ance providers in order to tap new markets, provide better risk transfer solutions 
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and make insight-based strategic decisions. To meet the demands of this rapidly 
evolving field, it is necessary to follow these current developments. 

As described in the User Stories, the service module is mainly designed for staff 
working in the underwriting and sales department of agricultural insurance compa- 
nies (majority of User stories serves this group of end-users). However, within this 
departments, there are several roles who can benefit from the services provided by 
Pilot #14. First of all, Actuaries (business professional/mathematician who analyses 
the financial consequences of risk by using statistics) are able to improve their data 
set for risk pricing and product development based on the data retrieved from the 
service module. Based on this information, Underwriters can better evaluate the 
risk and exposure of potential clients (crop monitoring) and hence make the over- 
all insurance portfolio more resilient by at the same time increasing the outreach 
to clients (farmers). Additionally, Sales Agents can identify areas where to prioritize 
sales activities without increasing the cumulative risk since they are aware of e.g. 
regional risk profiles. 

Lastly, with the support of data derived from the Octopush EO (damage assess- 
ment services), loss adjusters have additional information to make the on-farm 
process of loss adjusting more efficient and for certain perils conduct this process 
remotely via the service module (without visiting the farm/respective field). 

In addition to the implementation within insurance companies, at a later stage 
of the project other users in the insurance value chain can also be considered as end 
users. 

A first contact inside an insurance company in the Area of Interest (Serbia) has 
been made and immediately generated interest because of the benefits the Pilot 
#14 service module has to offer. The feedback on the presented services was very 
positive, just a final decision by the management is outstanding. 

As in this very first stage of the preparation of the pilot site the receiving of an 
appropriate and high-quality dataset from the pilot insurance company and the 
application of the services described in Pilot £14 have highest priority, there are no 
training plans developed for deployment to the final user yet. 

However, for the internal deployment at the final pilot site, Pilot #14 can provide 
an independent web-based user interface for the end users to access the service 
module via their browser. 


Components 


The pilot comprises ICT modules and services for the insurance sector. 


e [CT Modules: 


o Octopush EO Service (Data Source in RA): Octopush EO Service 
is an integrated satellite derived software service, which collects earth 
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O 


observation, geospatial, in-situ and other geo-referenced data. It applies 
appropriate processing algorithms and returns the results in a ready-to-use 
format. 

AgroApps Weather Intelligence Engine (AgroApps WIE) (Data Source in 
RA): The WIE is an integrated weather derived software service which col- 
lects weather information from several resources and along with the geo- 
referenced data, it applies appropriate processing algorithms and returns 
the results in a ready-to-use format. 

Data integrator (Data Ingestion in RA): The Data Integrator acts as a 
bridge between the WebGIS subsystem, Octopush EO service and WIE. 
It is responsible for performing the essential scheduled calls to the data 
providers in order to fetch and process the desired EO and weather infor- 
mation. It is able to run calls on demand or daily data integration tasks by 
retrieving EO data and weather products from Octopush EO service and 
WIE and transforms, binds, injects those into the WebGIS database. 
Business and Geospatial DB (Data Management in RA): Business DB 
offers a storage layer essential to carry the business logic and relevant infor- 
mation/data stored and managed by API. It also stores, retrieves and pro- 
vides information related to user accounts, settings, actions and prefer- 
ences. The geospatial data storage and data persistence mechanisms allows 
the storage of the geometries and zonal statistics and provides the essential 
functionality for querying and retrieving data via an API or WMP server 
components. 

Web Map Server (WMS Server) (Analytics and Machine Learning in RA 
for Geoserver and Interface for Apache Tomcat and RESTful API): WMS 
is responsible for rendering and serving of the GIS layers to the User Inter- 
face. 

RESTful API (Interface in RA): The API will act as a communication 
and data exchange bridge, that allows the platform to share processed and 
structured content internally, between the different components. 

User interface (Interface in RA): The front-end user interface is the gate- 
way responsible to present all the system data through user-friendly con- 
trols and web mapping interfaces. 


* Services for the Insurance Sector: 


Oo OO QO © 


Remote Damage Assessment for drought and hail 
Flood and wildfires mapping 

Short and medium range weather forecasts 

Seasonal Climate Forecasts of Agroclimatic Indicators 
Climate Risk Assessment 
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Conclusions - Issues and Barriers 


Based on the work done so far, we are aware of the following challenges: 


* Provision of UNINOVA IT infrastructure to run the testbed as foreseen in 
WP6 to enable enough computing capabilities to run the Weather Intelli- 
gence Engine. 

* Receiving appropriate and high-quality dataset from insurance company for 
the PoC and ongoing activities. 

* Identifying the right correlations between the data provided via the testbed 
and the dataset for the respective AOI in order to draw the right conclusions 
for the are risk profile and hence the insurance pricing for a drought index 
insurance product. 


To conclude the status of the pilot site preparation it can be stated that the part- 
ners involved in Pilot £14 (AGRO,GEN) are in close contact with Nova as 
the Testbed provider and are awaiting their notification of a successful set-up of 
the shared testbed infrastructure in the coming weeks. 

Furthermore, a good relationship was established with two agricultural insurance 
companies which would be able to provide the for this pilot required insurance 
company data for the defined AOI. 

Both insurance companies approached are composite insurance companies, 
hence not only focusing on agricultural insurance. The Agricultural Line of Busi- 
ness (LOB) of insurance companies in most markets is not the most profitable 
one. On the one hand, the service module developed in Pilot £14 will contribute 
to exploiting untapped market potential and new/innovative business and product 
opportunities, on the other hand though, it is difficult to convince the Management 
and the Underwriting Departments of all benefits. 

Therefore, GEN is using its business relationships to directly talk to potential 
decision makers. To convince those decision makers, GEN has pitched the overall 
goal of the INFINITECH project together with the objectives of Pilot #14, the 
structure of the pilot in general terms, data requirements and lastly the benefits in 
the short, medium, and long term for the pilot user (as defined in the user stories 
for agricultural insurance companies). 

In a next step, decision-makers will be given time for feedback and questions. 
Afterwards, a meeting together with the Tech-Proxy of Pilot £14 (AGRO) will be 
organized to dive deeper into the set-up of the service module, the capabilities of 
the module to provide additional data and to discuss the data requirements to be 
derived from the insurance company. 

This process is essential for reaching out to potential pilot users in order 
to test and evaluate the added value of the service module (based on defined 
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user requirements) developed for the specific business processes in agricultural 


insurance. 


3.2 Personalized Retail and Investment Banking 
Services 


Pilot #3: Collaborative Customer-centric Data Analytics for 
Financial Services 


This pilot would examine how banks and FinTech(s) in collaboration with research 
organisations and NGOs can develop an AI driven capability using transactional 
data generated by the financial activities that identifies money-related profiles based 
on the transactional data generated. Data profiles e.g. from social media then can be 
associated to human profiles base on their financial activity. These profiles will be 
built into the available AI engine and will be combined with existing technology and 
data sourced from the TAH human trafficking platform. The results will produce a 
complete picture of people profile, people trafficking routes and the corresponding 
money flows back to the criminal organizations. 

This pilot will utilize a combination of open-banking, social and internal-bank 
generated data sources to establish a high-volume and high-quality view of the cus- 
tomer to be used for a range of data analytics performed on big data platforms 
(Figure 3.13). The use of analytical methods could include link analysis to support 
permission-based customer relationship analytics on behalf of customer and bank, 
or transaction monitoring to support credit risk management for bank, but also 
that provide value for customers. 

The Pilot£3 will need to simulating a data sharing ecosystem by mimicing par- 
ticipants in that ecosystem and provide rules of engagement and highlighting the 
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Figure 3.13. Pilot#3 workflow. 
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value exchanges between participants. A digital ecosystem framework is described 


here to articulate testbed components required. 


Expected Outcomes 


The pilot will produce three data intensive systems, including a KYC system 
based on data sharing, a credit scoring system and an AML system operat- 
ing based on semantic technologies and blockchain based data sharing. The 
pilot evaluation will consider KPIs associated with the speed of the processes 
(i.e. KYC), customer satisfaction and customer engaging in sharing data. The 
workflow is described in Figure 3.9. 


Datasets 


Customer and Account Data (Bank Data). 
Customer to Customer Relationships (Bank Data). 
Customer Account Data (Open Banking Data). 
Other Open Customer Data (Social etc.). 


Pilot #3 will consider two sources of data: 


Operational Data Sources — We will not using existing BOI Ops. Data 
sources, because of confidentiality issues even if anonymized and also data 
consistency issues. Instead Proof of concept data sources are ‘synthetic’ cus- 
tomer, account and transactions data designed to mimic real world data sce- 
narios from financial services. 

Captured data from data entry in application including consent or metadata 
exhaust from sharing process. 


Data Produced 


For Pilot 3 a better representation of the data lifecycle might be as follows: 
Data utilized/transferred (E.g. data sharing payload — Customer/KYC, 
Account or Transaction Data), 

Data transformation (e.g. any data changes), 

Data produced (e.g. new data) & 

Data deletion (e.g. revoked consent) etc. 


Explainable Workflow 


The whole premise of the pilot purposed is to enable unlimited use cases between 


any participants via a single application creating a single ecosystem. Specific back 
yp P gle app 8 g M P 


end data services might be built to support a particular use case e.g. KYC. Below, 


Figures 3.14 and 3.15 illustrate a data flow of KYC use case in terms of business 


process and data flow and technical data flow, respectively. 
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Centralised ‘KYC Data’ Data Sharing Service 


Figure 3.14. Customer-Centric data analytics pilot workflow - KYC data sharing process - 
business workflow. 


Figure 3.15. Customer-Centric data analytics pilot workflow - KYC data sharing process - 
technical workflow. 
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Logical Schema 


The following figure refactors the components of the above-listed work- 
flows towards illustrating the pilot logical architecture in-line with the 


INFINITECH-RA. 


Pilot #4: Personalised Portfolio Management - Mechanism for 
Al Based Portfolio Construction 


The main goal is to develop and adapt within SaaS based Privé Managers Wealth 
Management Platform a Portfolio Optimization algorithm (further on called Privé 
Optimizer or "AIGO?), as well as improving and expanding its capabilities as an 
artificial intelligence engine to support better investment propositions for retail 
clients. 

This pilot will explore the possibilities of AI-Based Portfolio construction for 
Wealth Management processes, regardless of the amount to be invested (therefore 
the slogan "Private Banking could be for everyone"). The AI-Based Portfolio 
Construction will enable advisors and/or end-customers, to use the existing Wealth 
Management Platform “Prive Managers" and make use of its risk-profiling and 
investment proposal capabilities, starting from his/her personal risk-awareness (Fig- 
ure 3.16). AIGO allows for a variety of use cases which cater to the needs of finan- 
cial advisors, end-clients and financial services companies. The innovative AIGO 
genetic algorithm can be used for proposing investments and evaluating them given 
an easy-to-use, personalizable set of criteria, in the form of so-called fitness factors 
(Figure 3.17). These fitness factors will be used to generate “health” scores for port- 
folios, which are used to define the “fittest” investments. 


[t 


Figure 3.16. Customer-centric data analytics pilot pipeline in-line with IRA. 
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Figure 3.17. Pilot #4 roles and services. 


Starting from a client's cash pool or current investments/portfolios, the user will 
select the fitness factors and constraints or preferences to perform the portfolio con- 
struction, based on the client's risk profile and preferences. The optimisation tool 
that will be developed from the Pilot, will run on a pre-set universe of assets tak- 
ing into account all the input data and constraints. The AI genetic algorithm will 
generate a new proposal, where the selected preferences and risk parameters have 
been recognised. The optimisation tool can be run multiple times, after the neces- 
sary changes in initial parameters are made. In this context the main innovation of 
the pilot lies on the applicability of AI technologies to build customized portfolios 
(Private Banking for everyone). 


Technological components and Services 


The High-Level Architecture presented in Figure 3.18 presents the software com- 
ponents that build the Pilot's use cases. This figure has been used in D6.1 Testbeds 
Status and Upgrades to identify hardware requirements, and in D2.5 Specifications 
of INFINITECH Technologies to describe the technologies behind its principal 
components. 

This book series links the shown software components with the corresponding 
Reference Architecture layers, providing some details about their implementation. 
In this sense: 


* Data Collection (Data Management layer of the RA) based on customers cash 
pool or current investments/portfolios data. 

* Customers & investments/portfolio Data quality check (Data Processing 
layer in RA): according to specific data models, in order to perform the data 
preparation for portfolio construction, based on the client's risk profile and 
preferences. 
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Figure 3.18. Pilot #4 high-level architecture. 


e AI Based Portfolio Optimization Process (AIGO) that will be developed from 
the Pilot based on AI Algorithm, will run on a pre-set universe of assets taking 
into account all the input data and constraints, generating a new proposal, 
where the selected preferences and risk parameters for a specific customer. 

* New proposal for the personalized portfolio will be visualized through a PDF 
report generation or a JSON extract that will be able to be imported in any 
relevant portfolio management tool. 


Testbed 


As indicated below, Privé will be storing its Testbed on its own Amazon Cloud in 
AWS with an architectural setting as indicated below. 

Technical Specifications from Privé's internal Testbed 

Hardware Specifications (in case of Cloud Installation, include the relative cloud 
configuration) 3 VM instances, each with the following: CPU: Intel Xeon 3 GHz 
or faster Core: minimum 2 Core 4 threads Memory: 32 GB DDR4 1600 or 1866 
Hard Disk: 16 GB SSD. 


Technical Architecture Diagram 


Software Specifications including for each module the relative software stack that 
is used (e.g. Operating System, Layered Software, Application Software, Develop- 
ment Platform, Database, etc). 
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Operating System/Application Software/Layered Software 


The SaaS platform runs in multiple data centers with active-active setup to achieve 
high availability. Privé has the following environments: DEV, SIT, UAT and 
PROD. Data can be transferred via SFTP, FIX or API. Most Privé APIs are REST, 
but SOAP and GraphQL are also supported. The architecture is based on microser- 
vices. 

Operating System: Ubuntu 18.04 LTS Framework: SpringBoot: 2.2 Applica- 
tion Server: Tomcat: 7.0.103 Database: MySQL: 5.6.47 Database: MongoDB: 3.6 
Language Runtime: Java: OpenJDK 8u242. 


Development Platform (Figure 3.19) 


We use html5/ReactJS for frontend. Our platform is written in Java, with Spring 
MVC, Spring boot, and hosted with Apache Tomcat. 

Note that the requirements for Hardware specifications (e.g. RAM, No of CPUs, 
etc) will be required to be defined based on the requirements of the relative technol- 
ogy solutions (e.g. Data Management & Processing, Analytics & AI, etc) that will 
be used for each Testbed and relative sandboxes, in cooperation with the relevant 
Technical Partners. 

Privé testbed is ready to be used and the tests conducted on our own AWS cloud 
were successful. The proof of concept will be delivered and presented showing the 
current API functionalities and results stored in this testbed. 


Implementation of a first Proof of Concept 


As a minimum-viable-product or better first Proof of Concept, Privé will be pre- 
senting the back-end /calculations capability from the AI GO (Artificial Intelligence 
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Figure 3.19. Pilot #4 components. 
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Figure 3.20. Portfolio optimization procedure. 


Portfolio Construction Optimizer) via API Calls. This will consist of the optimiza- 
tion process presented for an example-portfolio via a couple of so-called fitness- 
factors which will allow to optimize a pre-given portfolio with a pre-determined 
investment universe. For the first proof of concept not all services or data sets 
described in the user stories will be implemented. The figure below highlights the 
PoC main components (Figure 3.20). 

In this case the input will consist of an investment universe of 50 European 
stocks and a pre-defined portfolioexample and the out-put will result into an opti- 
mized portfolio based on the selected user preferences (fitness-factors functions for 
the optimizer for that matter). Both data sets and testbed infrastructure have been 
described in more detail above. All the inputs and outputs will be callable via API. 


Expected Outcomes 


* The AI Based Portfolio Construction shall enable interested advisors or end- 
customers, after an initial "customer onboarding" (KYC, Risk Profiling) to 
upload relevant personal portfolios and start a portfolio optimisation pro- 
cess, where the AI Based portfolio construction is started together with a 
"genetic portfolio optimisation methodology". In several steps of portfolio 
calculations, the “fittest” portfolio construction — based on risk appetite and 


defined risk limitations — shall be identified. 


The following figure illustrates the portfolio optimization procedure in pilot 
#4. The first step is gathering client needs though constructive conversation with 
advisors. Then, analytical tools are applied to analyze data from input and logic. 
Last, the portfolio is optimized in 5 seconds. 


Datasets 


The data to be used by this pilot will be: 


* Customer Transactions Data: customer securities and cash transactions 
through their deposit accounts. They are fetched directly from the Bank or 
an Asset Manager; 
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* Financial Market Price Data: price data for Stocks, Bonds, Mutual Funds and 
or other assets like certificates/warrants. They are fetched from several Market 
Data Providers; 

* Financial Market Asset Master Data: asset related characteristics (e.g. expira- 
tion date, minimum investment amount, asset class breakdowns). They are 
fetched from several Market Data Providers; 

* Customer Risk Profile Data: customer Risk Profile Data through their 
account data and profiling, based on B2B customers parameters. They are 
fetched directly from the Bank or an Asset Manager; 

* Mutual Fund, ETF and Structured Products Breakdown: asset breakdowns 
based on bank data or market data providers breakdown. They are fetched 
from several Market Data Providers: 

* Customer Economic Outlook: they are fetched directly from the Bank or an 
Asset Manager based on questionnaires and Customer Profiles; 

e Single Account & Investors Data: 19484 accounts for about 15400 investors 
(live data) 94.407 different securities available; Investors serviced by 309 dif- 
ferent advisor companies; Accounts in 28 different custodian banks (Data 
from 2019). All datasets will be stored within Privé SaaS solution in a cloud 
setup. Asset data and Client data are fetched from 3rd party databases and 
partially from selected market-data providers. Risk metrics are calculated in 
the historical backtesting component for each single portfolio. A Genetic 
Algorithm component evaluates different Fitness Factors and generates a cus- 
tomised portfolio proposal. 


Data Produced 


JSON files will be produced from Privé API (if other 3rd party solutions address 
to this Portfolio Optimisation functionality, and PDF files can be generated for UI 
display and customer documentation. 

The output data consists of the single portfolio holdings, their weights and 
amounts to decide about the Proposed Portfolio. Fitness Factors Scores and Total 
Fitness Score will be output for both the current and proposed (optimised) portfo- 
lio. For both Portfolios also Risk and Return metrics will be shown: 5 year annual- 
ized return, volatility and sharpe ratio. 


Explainable Workflow 


Starting from a client's cash pool or current investments/portfolios, a risk profile is 
created or an existing one is updated (Steps 1 to 3 on Figure 3.21). Then the user 
will select the fitness factors and constraints or preferences to perform the portfolio 
construction, based on the client's risk profile and preferences (Step 4). The optimi- 
sation tool will run on a pre-set universe of assets taking into account all the input 
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Figure 3.21. Personalized portfolio management pilot workflow. 


data and constraints (Steps 5 to 7). The AI genetic algorithm will generate a new 
proposal, where the selected preferences and risk parameters have been recognised 
(Step 8 and 10). The optimisation tool can be run multiple times, after the neces- 
sary changes in initial parameters are made, based on that the proposed portfolio 
is satisfactory or not (Step 9). This process can result in a UI proposal or a PDF 
generated investment proposal. 

Both inputs and outputs will be stored in Privé own cloud. Al fitness-functions 
within the AIGO will be callable via API based on the initial user preferences inputs. 
All the datasets will also be stored on Privé side for both inputs and outputs for the 
algorithm. 


Logical Schema 


An initial mapping of the explainable workflow of the pilot to the INFINITECH- 
RA layers and constructs is depicted in the following Figure 3.22. 
Pilot’s Reference Architecture can be simplified considering: 


* A Data Management layer, that performs data ingestion based on cash pool 
or current investments/portfolios data, quality checking and harmonization 
of the data provided in order to be imported into the datastore for use of the 
pilot’s functionalities. 

* A Data Processing Layer in charge of homogenise and store all data collected, 
according specific data models to perform the data preparation for portfolio 
construction, based on the clients risk profile and preferences. 

* An Analytics layer, that will be based on the AIGO optimisation tool that 
will be developed from the Pilot, which will run on a pre-set universe of assets 
considering all the input data and constraints. The AI genetic algorithm will 
generate a new proposal, where the selected preferences and risk parameters, 
based on the data provided from the customer and the relative investments/ 
portfolios, available. 
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* Finally, a visualization layer will provide the proposed portfolio suitable for 
the specific customer through a report in PDF format or JSON response. 


Privé external stakeholder regarding AIGO is currently Report Brain. Privé will 
provide the technology for the optimization process. On top of that Reportbrain 
will support Privé with their own specific dataset. In that way, the development will 
be carried out by Privé with the support of Reportbrain. The end user will be advi- 
sors, asset managers, insurance companies, banks, family offices or their end-users/ 
clients. 


Data Components 


This section links the software components with the corresponding Reference 
Architecture layers, providing some details about their implementation. In this 
sense: 


* Data Collection (Data Management layer of the RA) based on customers cash 
pool or current investments/portfolios data. 

* Customers & investments/portfolio Data quality check (Data Processing 
layer in RA): according to specific data models, in order to perform the data 
preparation for portfolio construction, based on the client's risk profile and 
preferences. 

e AI Based Portfolio Optimization Process (AIGO, Analytics layer in the RA) 
that will be developed from the Pilot based on AI Algorithm, will run on a pre- 
set universe of assets taking into account all the input data and constraints, 
generating a new proposal, where the selected preferences and risk parameters 
for a specific customer. 

* New proposal for the personalized portfolio will be visualized through a PDF 
report generation or a JSON extract that will be able to be imported in any 
relevant portfolio management tool. 


Conclusions - Issues and Barriers 


After development started, Privé successfully finished implementing a First Proof of 
Concept in a UAT Environment stored on our own AWS Cloud. The pilot testbed 
is already set up and available via SaaS access. 

The main challenges consisted of the Market Data Availability Setup on our 
UAT Environment, as will be required the relative datasets to be enhanced either 
with more customer portfolio data, or with more variety of financial instruments 
data available from various sources that will affect the fitness factors and constraints 
to perform better proposed portfolio construction. 

Similar challenges could arise in the future as other investment universes or mar- 
ket providers are made available for the optimization process. 
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Also, the integration ofa so-called new fitness-factor based on Reportsbrain Mar- 
ket Sentiment Factor as an external provider via API could bring up some challenges 
too, as will require further exploration of AIGO optimisation tool capabilities in 
order to provide better results for personalized proposed portfolio taking in account 
also sentiment analysis factor. 

In general the outcome of this pilot will be develop and adapt within SaaS based 
Privé Managers Wealth Management Platform a Portfolio Optimization algorithm 
AIGO (or Privé Optimizer), as well as improving and expanding its capabilities as 
an artificial intelligence engine to support better investment propositions for retail 
clients, that can be used as SaaS service through an API for other interested parties 
(investment firms, private banks, wealth management firms, etc). 


Pilot 45A: Smart and Personalized Pocket Assistant for 
Personal Financial Management 


This pilot will build a personal pocket assistant for clients of the bank, based on 
the development of an Al-enabled personal financial management (PFM) software. 
The assistant will process large about of data concerning the full range of an indi- 
vidual’s or an enterprise’s interaction with the bank based on a variety of different 
analytics techniques, including external data from other entities, predictive analyt- 
ics and machine learning technologies. Its main characteristic will be its ability to 
make comparisons between clients with similar profiles, launch custom offers for 
every client and predict and alert end-users on future activities. 


Expected Outcomes 


e Patterns Detection Engine (including fraud). 
* Recommender Engine. 
* A Mobile App as UI for customer interaction. 


Figure 3.23 summarizes the pilot workflow where inputs include customer pro- 
file, customer transactions. LIB transactions and open data sets. After applying 
the functionalities, the outputs include recommendations/ customer offers, alerts, 
chatbot/intelligent interaction. 


Datasets 


* Customers & Retail Customer 
e LIB clients transactions 


Selected assets 
Thousands of Profile Data/Photos 


54 INFINITECH Implemented Solutions 


— 
ss Functionalities Output 
eee 
Customer profiles * Identification of frequently 
repeated transactions 


B F * Identification of similar Recommendations 
— B / Custom offers 


p users 
ustomer - 
transactions * Estimated costs A 
ffi * Prediction of cash flow Alerts 

— En issues 
UB transactions * Anomalous banking 

e movements 

* Anomalous spending 
O M behavior 
Open data sets 


Figure 3.23. Pilot 45A: smart and personalized pocket assistant for personal financial 
management. 


Pilot 5B: Business Financial Management (BFM) Tools 
Delivering a Smart Business Advise 


Most of today's Financial Management tools for Small Medium Enterprises (SMEs) 
are geared towards analysing only past transactions, making such tools inadequate 
in today's world. Today, SMEs and their customers alike demand just-in-time pro- 
cessing, transparency and personalized services to assist SME owners not only in 
understanding better their SME business/financial health but also to be able to 
decide on the next best action to take. Thus, Pilot#5b aims to assist SME clients of 
Bank of Cyprus (BOC) in managing their financial health in the areas of cash flow 
management, continuous spending/cost analysis, budgeting, revenue review and 
VAT provisioning, all by providing a set of AI powered Business Financial Manage- 
ment tools and harnessing available data to generate personalized business insights 
and recommendations. Machine learning algorithms, predictive analytics and AI- 
based interfaces will be utilized to develop a kind of smart virtual advisor with the 
aim to minimize SME business admin effort, to focus on growth opportunities and 
to optimize cash flows performance. 

Main stakeholders of the pilot development include Bank of Cyprus (BOC) and 
University of Piraeus Research Centre (UPRC). BOC is providing a variety of data 
mainly regarding its SME clients and their respective transactions, while also being 
the key driver in designing the Business Financial Management toolkit, which will 
generate valuable insights and add value to the existing online services for SME 
beneficiaries. UPRC is working closely with BOC in designing all provided services. 
It is responsible for the development of all required ML/DL algorithms of the pilot 
and the technical support of the pilots implementation throughout the project. 
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The pilot aggregates a variety of data related to SMEs accounts from Bank of 
Cyprus’ operation data warehouse, which include: (i) account, (ii) customer and 
(iii) transaction data. Moreover, (iv) open banking data will be utilized to provide 
a holistic approach, as well as (v) invoice data from the SMEs in order to provide 
accurate reconciliation services. 


Technological components and Services 


All services developed focus on providing valuable business insights and recommen- 
dations to the SMEs, empowering them to effectively monitor cash flow, budget- 
ing, revenue and perform reconciliation activities, all leading to improved business 
management and data-driven decision making. The services provided are depicted 
in the Figure 3.24: 

The figures show a set of different services/components/engines. Each one, in a 
different development stage. 

An early version of the Transaction Categorization Engine, which is considered 
a key component, has been developed. This component is in charge of labelling 
the transactions of selected SME customers of Bank of Cyprus into 20 main cat- 
egories (with around 80 respective subcategories to be implemented soon). This 
first version has been implemented combining rule-based classification and ML 
algorithms. 

The development of the Cash Flow Prediction component has also been initi- 
ated, exploring a variety of ML models to predict the expenses of certain categories 
of a given account in a short period of time. 

These two have already started the development and will be included into the 
fist PoC. The development process will include/add new components: 


* Budget Prediction engine that allows setting easily budget targets through the 
provision of suggested target values as well as simple budget monitoring. 

e KPI engine leading to valuable insights on the SME financial health and per- 
formance. 

* Transaction monitoring engine that watches out for potential anomalies and 
savings. 

* Invoice Processing engine that generates meaningful invoice background info 
to other components (e.g. Cash Flow Prediction) and SMEs. This applies if 
respective data can be obtained from SME relative ERP system. 

* Benchmark engine supporting comparisons to other SMEs with similar pro- 
files and 

* Recommender engine generating actionable insights for a SME that will allow 
to perform better. 
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Testbed 


Bank of Cyprus (BOC) is developing an AWS testbed, based on the technical 
requirements and guidelines of the relevant partners, and tailored for the unique 
pilots components and the required data ingestion. As the testbed's specifications 
have not yet been finalised and certain bank processes require time, until the bank's 
AWS ecosystem is available the pilot's first components will be hosted in GFT's 
AW'S environment. 


Other non-technical requirements 


The pilots component providing competitive advantage among other available 
BFM tools is considered to be the Smart Virtual Advisor that leverages extensively 
supervised and unsupervised machine learning, takes into consideration the output 
from all BFM tools to come up with a holistic view of the SME business corre- 
sponding accurate business advise and reconciliation all fostering an optimal day to 
day business operation. The main non-technical requirement to achieve this will be 
solving all consent and data protection issues arising from including such enterprise 
data. 


Implementation of a first Proof of Concept 


The Proof of Concept is aiming to establish the foundation for the various smart 
Business Financial Management (BFM) engines. To achieve this, the design, devel- 
opment and implementation of a Transaction Categorization engine is prioritized 
as it maintains a vital role for the development and interconnection of all other 
components (Figure 3.25). To demonstrate the integration between the various 
engines, a basic Cash Flow engine will also be implemented. 

The pilots testbed will be accommodated by Bank of Cyprus, which is going to 
provide an AWS environment for the various pilots components and operation. As 
the testbed development has not yet been completed, the PoC version represents a 
static development approach, where data have been collected and preprocessed by 
BOC and then sent to UPRC, where the Transaction Categorization and the Cash 
Flow prediction components are developed locally at the universitys premises in 
an offline environment. Once development of the testbed is completed, the two 
components will be migrated to the INFINITECH ecosystem and will be fine- 
tuned accordingly. 

Pseudonymized data have already been transferred to UPRC in .csv format to 
initiate the development of the two main components of the PoC version. Those 
data include: 


* Customer Data from BOC: Data regarding selected SME BOC dlients that 
will be the first. 
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Figure 3.25. Pilot #5B: business financial management (BFM) tools delivering a smart 
business advise. 


* Account Data from BOC: Information regarding more than a thousand 
accounts linked to the abovementioned selected SME clients. 

e Transaction Data from BOC: Dataset with approximately the transactions 
of the selected SME BOC clients over the last three years. The dataset is 
considered the main source for developing the first two pilots components. 


Rest of the pilot's data will be utilised to enrich and refine the Categorization 
and Cash Flow Prediction components included in the PoC and will also be crucial 
for the development of the rest of the components. 


Expected Outcomes 


The BFM tools will drive the SME digital adoption rate as well as pave the ground 
for reduced credit risk, lowering amount of Non-Performing Loans (NPLs) and 
moreover for vital needed improved/streamlined SME lending. The expected deliv- 
erables are: 


e AI driven Transaction Categorization Engine. 
e AI driven Financial Business Advice (Insights) Engine. 
e Chatbot for BFM (crowd policy). 


The following figure illustrates the general workflow of Pilot #5B. It includes 
data sources, decision support system processes, BFM outputs and personalized 
recommendations. 

Datasets 


The following data sources will be integrated and used in the pilot: 


e Transaction Data from BOC: a .csv file with around 500MB and 3.5 millions 
of transactions between 2018 and 2019; 
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* Transaction Data from Open Banking (i.e. PSD2 data); 

* Transaction Data from SMEs (optional); 

* Other Data (Market); 

e Other Data from SMEs (optional); 

* Accounts Data from BOC: maps accounts with the transactions; 

* Accounts Data from Open Banking; 

* Customer Data from BOC: links customer to accounts and the available 
NACE code is used in the transactions’ categorization model; 

* Direct Input from SMEs (e.g. feedback loop for transaction categorization). 


Data Produced 


The pilot will combine the abovementioned diverse datasets in order to produce 
personalized business insights and recommendations for SME customers of BOC. 
Output data, as shown below, will be generated by the various engines in relation 
to cash flow predictions, budgeting, KPIs, benchmark(s) and transaction monitor- 
ing and categorization. The data will be stored in the common datastore and be 
available to the end user (SME) via the Infinitech Reference Architecture (IRA) 
gateway (and the banks middleware). To this end a pilots-specific REST API will 
be developed leveraging different endpoints for each specific service. The output 
data/endpoints include: 


* A JSON containing the obtained insights and recommendation to be pro- 
vided to the respective SME. 

* A JSON containing the obtained cash flow related data to be provided to 
SME directly or indirectly. 

* AJSON containing the derived budget target for each category used by the 
respective SME. 


* AJSON containing results on Financial Health and Performance matrix. 

* A JSON containing results on abnormal transactions and suspicious 
expenses. 

* AJSON containing Matrix with invoice information and payment prioriti- 
zation. 

* AJSON containing benchmarks that allow the SME to compare to likewise 
businesses. 


Explainable Workflow 


Some of the available datasets require real time data collection, while in others 
historical data collection is sufficient to provide actionable business insights. In 
detail, transaction and account data related to the respective SME will be drawn 
from BOC’s repository by a real time/historical data collector as well as transaction 
and account data from Open Banking (PSD2), as well as BOC customer data, 
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will utilize a historical data collector. Furthermore, a way of handling batch of 
data is needed to provide as there should be an option of pushing data to the 
Infinistore once a day by the bank (e.g. in cases where the real-time connection 
is lost or for the purpose of uploading history data). To this end, the bank IT 
team will be capable of uploading a batch of data in CSV format directly to the 
pilot specific cloud sandbox. IIn addition, an external data collector will also be 
used in order to integrate other related Open Banking/macroeconomic data. The 
SMEs data source (e.g. ERP/Accounting system) utilization remains optional as 
consent is required for the collection and processing of such data and its cloud 
availability being required. All data except external macroeconomic data will be 
pseudoanonymized (by tokenization) before being uploaded to the IRA. The cloud 
Data Repository (within IRA) will then store all collected data, along with the gen- 
erated insights, past SME financial actions (to measure at what degree the SME 
actions reflect the recommended insights), as well as minimum user input which is 
required. A continuous data streaming will connect the Data Repository with the 
various deployed BFM tools (machine learning algorithms), which would allow 
the retraining of the respective AI models and the generation of useful insights and 
recommended actions. A reverse data pseudoanonymization will then be applied 
before the processed data move to the bank middleware component that contains 
composite APIs and produces push notifications, all which will be offered to the 
SMEs via Android, iOS and web apps. Upon SME user login the IRA is also 
accessed, insights/recommendations picked up from the cloud data repository and 
provided to the SME user. To this end, a prototype component will be developed 
in order to digest and properly present the results of the corresponding analytics 
components. The pilot's workflow is depicted in Figure 3.26. 


Logical Schema 


The following figure illustrates a logical view of the pilot system architecture in-line 
with the INFINITECH-RA. 

The datasets used, as well as the pilots RA is illustrated in Figure 3.27. All per- 
sonal and sensitive data related to SME customers of BOC will be pseudonymized 
at the bank's premises using a tokenization approach before streamed to the 
INFINITECH ecosystem to ensure the protection of vital SME data. A reverse 
pseudonymization will be applied before presenting the data to the SME end user. 
The RA of the pilot, as included in D2.13 and depicted below: 

The various components will be containerized using Docker, and a LeanXscale 
database will be used to store and query the results of the analytics processing, as 
well as insights generated by the recommender engine. Most of the data analytics 
components are developed using Python data analytics and ML/DL libraries, i.e. 
Numpy, Pandas, ScikitLearn and Tensorflow, where data streams required in some 
components for real time analytics will be handled with Apache Kafka. For the time, 
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Figure 3.26. Business financial management (BFM) pilot workflow. 


a static approach has been followed and all development progress has been done in 
offline mode in University of Piraeus premises, with all progress being migrated to 
the INFINITECH ecosystem once the pilot’s AWS testbed is set. 


Components 
The following components will be deployed and used in the pilot pipelines: 


e Transaction Categorization Engine (Analytics layer in the RA): key compo- 
nent in charge of labelling the transactions of selected SME customers of 
Bank of Cyprus into 20 main categories (with around 80 respective subcate- 
gories to be implemented soon); 

* Cash Flow Prediction component (Analytics layer in the RA): based on a 
propabilistic Deep Neural Network (implementation of DeepAR model) to 
predict the expenses of certain categories of a given account in a time horizon 
of 12 weeks; 

* Budget Prediction engine (Analytics layer in the RA): allows setting easily 
budget targets through the provision of suggested target values as well as sim- 
ple budget monitoring; 

e KPI engine (Analytics layer in the RA): leading to valuable insights on the 
SME financial health and performance; 

e Transaction monitoring engine (Analytics layer in the RA): watches out for 
potential anomalies and savings; To this end Graph analysis apporaches is 
being explored and implemented; 
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* Invoice Processing engine (Analytics layer in the RA): generates meaningful 
invoice background info to other components (e.g. Cash Flow Prediction) 
and SMEs. This applies if respective data can be obtained from SME relative 
ERP system; 

* Benchmark engine (Analytics layer in the RA): supporting comparisons to 
other SMEs with similar profiles; 

e Smart Advisor (Analytics layer in the RA): generating actionable insights for 
a SME that will allow to perform better. 


Conclusions - Issues and Barriers 


Concluding the Pilots development is progressing based on the projects timeline 
already establishing the Transaction Categorization and Cash Flow Prediction com- 
ponents that are considered the foundation for designing and developing the rest 
of the AI powered components included in the BFM toolkit that will be the out- 
come for SMEs. Next pilots milestone is moving all development progress to the 
cloud environment and setting the required data streaming/data collection mech- 
anisms. Main challenge is the AI powered Business Financial Management tools 
development and their efficiency, as will be based on the availability of all the 
required data for SMEs from BOC or from the SMEs in order the final goal to be 
achieved. 

Main goal of the Pilot aims to assist SME clients of Bank of Cyprus (BOC) 
in managing their financial health in the areas of cash flow management, continu- 
ous spending/cost analysis, budgeting, revenue review and VAT provisioning, all by 
providing a set of AI powered Business Financial Management tools and harness- 
ing available data to generate personalized business insights and recommendations. 
Machine learning algorithms, predictive analytics and AI-based interfaces will be 
utilized to develop a kind of smart virtual advisor with the aim to minimize SME 
business admin effort, to focus on growth opportunities and to optimize cash flows 
performance. 


Pilot #6: Personalized and Intelligent Investment Portfolio 
Management for Retail Customer 


The goal of this pilot is to create a system for personalized investment recommen- 
dations for the retail customers of the bank. NBG will leverage large customer 
datasets and large volumes of customer-related alternative data sources (e.g., social 
media, news feeds, on-line information) in order to make the process of providing 
investment recommendations to retail customer more targeted, automated, effec- 
tive, as well as context-aware (i.e. tailored to state of the market). 

Pilot 46 focuses on providing personalized investment recommendations for the 
retail customers of the bank. National Bank of Greece (NBG) will leverage large 
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Figure 3.28. Pilot #6 personalized closed-loop investment portfolio management for 
retail customers. 


customer datasets and large volumes of customer-related alternative data sources 
(e.g., social media, news feeds, on-line information) in order to make the process 
of providing investment recommendations to retail customer more targeted, auto- 
mated, effective, as well as context-aware (i.e. tailored to state of the market). The 
latter is the main innovation of the pilot. An overview of Pilot #6 is given in the 
Figure 3.28: 


Technological components and Services 


Going a step beyond the Pilot's RA towards the functional overview shown in 
Figure 3.29, the High-Level Architecture presented presents the software compo- 
nents that build the Pilot's use cases. This figure has been used to identify hardware 
requirements and describes the technologies behind its principal components. This 
document links the shown software components with the corresponding RA layers, 
providing some details about their implementation. In this sense: 


e NBG supply raw datasets required for the implementation of the final ser- 
vices. The pilot has already identified the relative customers portion of data 
that will be utilized, based on the existing DWH (Data Ware House). 

* Data Collection and Data Normalization components (Data Management 
and Protection layers of the RA): based on Icarus from UBI, define the rules to 
(Data Processing layer in RA): process and harmonize, cleanse and anonymize 
data from NBG and insert them in a datastore available from LXS. 

* Customer Risk Profile Cluster implemented by ML/DL Algorithm devel- 
oped by NBG that cclassify customers into 4 risk profiles: Conservative, 
Income Seeking, Balanced, Growth Seeking. The algorithm is applied to both 


investors (having answered the MiFID questionnaire) and non-investors. 
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Figure 3.30. Pilot #6 hardware/software requirements for the testbed. 


* Personalized Investment Recommendation AI engine, that will also utilize 
sentiment analysis data from RB, will produce the recommended instruments 
for investment. 

* Customer initiation and personalized recommendation is obtained through a 
visualization application developed by CP. This application will also orches- 
trate the processes of analysis, initiation, execution and processing, when a 
new customer or new data are available. 


Testbed 


Pilot's #6 final deployment relies on MS-Azure cloud infrastructure that NBG will 
provided. Further details of the software/hardware first analysis and their results can 
be found in that document, but are summarised in the Figure 3.30: 

NBG MS-Azure infrastructure it's currently deployed from NBG IT team in 
order to accommodate the first Proof of Concept being under development and 
based on the Pilots development progress will be adjusted in terms of resources to 
accommodate any additional requirements related to resources. 


Other non-technical requirements 


Besides the technical requirements that compose the core pilots platform, the AI 
technologies deployment and the data collection, we have not identified any other 
non-technical requirements that may affect the best outcomes for the Pilot. 
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Figure 3.31. Pilot #6 Data collection PoC architecture. 


Implementation of a first Proof of Concept 


First Pilot #6 demonstrator (PoC) is focused on processing a subset of Cards and 
Deposit Accounts Transaction Data extracted from bank’s Operational DWH. 
Figure 3.29 presents the functional diagram of the developed PoC. 

Based on the raw datasets available from NBG for retail customers, a first version 
of the relative ML/DL algorithm that will be implemented from NBG, will provide 
the Customer Risk Profile clustering. The Risk Profile will be one of the different 
inputs to feed the final core component: Personalized Investment Recommendation 
Al engine. The AI Engine is not available in the first PoC. 

Based on the algorithm results for Customer Risk Profile clustering a web page 
will be provided as dashboard for visualization of the results as a way of making first 
demos of the PoC (Figure 3.31). 

Proof of Concept execution, will provide valuable feedback for the AI approach 
that will work better for the Pilot execution, as well as create the common ground 
for the future of the development that will be required in order the full scope of the 
Pilot to be realised. 


Expected Outcomes 


* BigData/AI system for personalized investment recommendations for the 
retail customers of the bank. 

e Development ofa closed-loop system that continuously learns, improves itself 
and provides better recommendations. 

* The system shall improve productivity of investment consultants of the bank, 
through enabling them to access faster recommendations tailored to their 
retail customer needs. 
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Figure 3.32. Pilot #6: personalized and intelligent investment portfolio management for 
retail customer workflow. 
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Figure 3.33. Personalized closed-loop investment portfolio management pilot pipeline 
in-line with IRA. 


The Figure 3.33 depicts Pilot #6 workflow for personalized and intelligent 


investment portfolio management for retail customer. 


Datasets 


Data that will be used for this pilot will be extracted and anonymized in CSV files 
from NBG Datawarehouse and several data sources: 


* Deposit Account Transactions: Data of Deposits accounts transactions for 
retail customers are extracted for the last two (2) years (8,91). 

* Cards Transactions: Data of Transactions related to Cards for retail customers 
for the last two (2) years (7,3GB). 

* Instruments Historical Prices: Data for Instruments Historical Prices for the 
last two (2) years (0,23GB). 

* Investment Related Transactions: Data of Investment Related Transactions 
for last two(2) years (0,3GB). 
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* Instruments Characteristics: Data for Instrument characteristics for matching 
with customers profiles, including asset class, currency, ISIN, maturity etc. 
(0,01GB). 

* CRM Data: 150.000 Customers related data like demographics, product 
ownership and responses to MIFID questionnaires (0,05GB). 

* Sentiment Analysis for each instrument proposed from Data Analysis as rec- 
ommendation using RB information from the news or/and social media to 
provide to NBG customers with clearer and real-time risk results. 


Data Produced 


Personalized investment recommendations for the retail customers of the bank, 
based on their Risk and transactions profiles. Banks relationship managers based 
on each customer risk and transactions profile, will be able to propose the possible 
alternatives of financial instruments that a customer will be interested to invest, with 
the relative prioritization. The proposed recommendations will be based on the 
instruments available from the bank, with the necessary input data for sentiment 
analysis for each financial instrument, based on the news & social feed, for the 
specific instrument (e.g. stock, bond, etc), or the relative instrument category. 

Existing Landscape in Financial Institutions and particularly Banks has set as pri- 
ority the identification of targeted Customer propositions and especially in invest- 
ments sector. Driven both by Competition as well as Customer needs, depiction of 
each Customers potential and risk appetite in combination with interesting for the 
Customer recommendations, may lead in the increase of each Customer’s share of 
wallet and at the end increase of Bank’s Market share in the specific Sector. 


Explainable Workflow 


Data from NBG Datawarehouse related to Investment Products Retail Clients 
(CRM Data, Deposit Account Transactions, Cards Transactions, Investment 
Related Transactions), will be extracted in CSV files and utilizing the relative tools 
for data processing, anonymization and quality checking and cleansing will be 
imported to Leanxcale Datastore. Based on the data extracted for NBG Clients, 
through the Customer Risk Profile engine using data analysis tools, will be able to 
divide all customers in specific profile clusters based on the investment & banking 
behaviour (MIFID questionnaires), deposit and card transactions, as well as invest- 
ments transactions. Also, NBG will provide for each investment customer cluster 
profile the relative instruments that will be suitable for investment. 


Logical Schema 


A first approach to mapping the pilot architecture to the INFINITECH-RA layers 
and pipelines approaches is illustrated in the following figure. 
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Components 


The following components will be deployed and used in the pilot: 


e DataStore (Leanxcale) (Data Sources in RA). 

e NBG Datasets (Data Sources in RA). 

* Data Collection (UBI Icarus) (Data Management in RA). 

* Data Normalization (UBI Icarus) (Security in RA). 

* Customer Risk Profile Cluster (Analytics in RA): classify customers into 4 
risk profiles: Conservative, Income Seeking, Balanced, Growth Seeking. 

e Personalized Investment Recommendation AI engine (Analytics in RA). 

* Customer initiation and personalized recommendation UI Application (Pre- 
sentation in RA). 


Conclusions - Issues and Barriers 


Based on the work done so far, the Pilot it seems that is on track, following the 
implementation plan already agreed with all the contributing partners and utilize 
the available technology components already available or will be, as part of the 
INFINITECH project. 

The main foreseen challenges would include: 


* Implementation of the best performed ML/DL Algorithms, for this purpose 
we have started to evaluate some of the algorithms already available from 
INFINITECH partner University of Glasgow (GLA). 

e Setup of the relative testbed based on the blueprint reference architecture (as 
will be hosted on MSAzure). 

e Calibrate the ML/DL Algorithms to provide best results for investment rec- 
ommendations. 


As the outcome the Bank will develop a better and more trustful relationship 
with its customer base, who hopefully will gradually turn exclusively select the spe- 
cific bank for the entire spectrum of financial advice, products, and services. The 
Bank will also increase its trading volumes. The investment consultants will see 
their productivity improving. 


Pilot #15 Open Inter-banking 


Pilot 15 main objective is to deliver a prototype to address and tackle business 
pains shared within banking institutions leveraging Machine Learning and Natu- 
ral Language Understanding paradigms. The model aims at reading and analyzing 
extensive internal documents of banks in real time to highlight the main concepts 


Personalized Retail and Investment Banking Services 71 


and compare them with a reference taxonomy to build a common business glossary 
in order to: 


* Provide banks with a tool able to standardise the documentation analysed; 

* Increase Automation and Intelligence based on data processing leveraging 
data governance processes; 

* Improve the analysis and comprehension capabilities of internal documents 
and contents. 


The Inter-Banking Open pilot, as explicated by its name, is the result of an 
Open Call to shared business pains among several Banks, and its objective is to 
develop a solution that could address and tackle such pains in a pre-competitive 
environment. Due to its composition, the pilot is strongly market-driven and aims 
to implement the prototype of a solution based on Machine Learning and Natural 
Language Understanding paradigms. 

This prototype will start from the analysis of a subset of process operating doc- 
uments to attempt the classification of the information contained in them with 
respect to the ABI Lab taxonomy, used by Italian banks to build their business 
glossary and in general to support the Enterprise Architecture Modelling. 

ABI Lab is the Banking Research and Innovation Centre founded and promoted 
by the Italian Banking Association (ABI). Through research and advocacy, ABI Lab 
promotes innovation as a mean of growth and reinforcement of the banking sys- 
tem. To support digital transformation, ABI Lab has created the AI Hub, a cen- 
tre of excellence to discuss over the AI application in the banking and financial 
sector. 

Within the AI Hub, the objective of the pilot is to promote the development of 
a common use case, which will involve different banks through a shared research 
approach. The use case will be developed following two steps, as described in 
Figure 3.34. 
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Figure 3.34. Pilot 415 steps and main objectives. 
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Technological components and Services 


The main objective is to build an AI tool able to read the internal documents of a 
bank to highlight the main concepts and compare them with reference taxonomies 
to build a common business glossary. 

Technological components and services will be defined according to the pilot 
objectives. 


Testbed 


The technical and development aspects, in particular within the dedicated testbed, 
will be supported by GFT and HPE. The pilot will be hosted and deployed on the 
Testbed blueprint that will be developed accordingly to the pilot requirements. 


Data Sources 


Data will be extracted from a large set of bank's internal documents in pdf, word 
and/or txt format, provided by the banks involved in the pilot. The documents 
will focus on the following areas: KYC, entering into a relationship with the cus- 
tomer and the Markets in Financial Instruments Directive (MIFID). In addition 
to documents related to the three specific areas, other data sources includes: 


* Additional documents relating to different areas and identified randomly 
within the document base: 

* Internal dictionaries, internal glossaries, internal taxonomies useful for the 
development of metadating techniques. 

* ABI Lab architectural framework (reference taxonomy) 


Data Produced 


The advanced document processing will allow real-time useful information via 
searching semantically relevant text according to the semantic metadata, increas- 
ing automatisation, easiness of use and usefulness of outcomes. 


Explainable Workflow 
STAGE 1 — study and research @ ABI Lab controlled environment 


* Data ingestion/preparation, including technical components aimed at nor- 
malising and aggregating the data that we need for our specific analytical 
purposes, preparing the information to be processed by the Machine Learn- 
ing tools; 

* Data storage, including tools and infrastructures aimed at data collection 
from different sources and in different formats, and their storage; 


Personalized Retail and Investment Banking Services 73 


* Machine learning engine optimisation, enabling continuous Natural Lan- 
guage Understanding algorithms optimisation, following the use case exper- 
imental purposes 

e Semantic model design A data visualisation layer, including tools and meth- 
ods to display results to different users and stakeholders. 


STAGE 2 - test and validation @ Infinitech testbed (Model based on BDVA RA) 


Logical Schema 


The following figure illustrates the logical architecture of the pilot in-line 
with INFINITECH-RA constructs and approach. Pilots Reference Architecture 
(Figure 3.35) and main data flows is presented in detail. This RA can be outlined 
in three main layers to be implemented through different software components. 
These main three layers are: 


* A Data Management layer, that performs data quality checking and harmo- 
nization of the data provided from NBG and imported them into the datas- 
tore for use of the pilot’s functionalities. On this first stage, the data that will 
be used transactions data of deposit accounts, cards, investments and CRM 
data for a small subset of NBG Customer will be used. 

* A Data Protection and Data Processing Layers in charge of cleanse, 
homogenise and store all data collected, according to specific data models 
provided from NBG operational DWH, so these are available for the analyt- 
ics processes. Here are also included all the operations needed to anonymise 
(if required) the captured data and protect this information from unautho- 
rised access. 

* An Analytics block, fed by the data layers, where different ML/DL tech- 
nologies and visualization tools will enable data monitoring, analysis, and 
exploitation. Two main AI models will be developed here, the Customer 
Risk Profile Clustering and Personalized Investment recommendation deci- 
sion support, that will utilize also the Sentiment Analysis data provided from 
RB relative engine, in order to provide for a customer, the recommended 
products to invest through a visualization application (in the RA’s Visualiza- 
tion layer). 


The main stakeholders for this pilot are the account officers of a bank, who will 
be able to provide personalized investment recommendations for customers. Rec- 
ommendations based on customer (risk) profile, as well as with the relative senti- 
ment analysis data from the news, social media, and other resources on the internet. 
In Pilot #6 these stakeholders will be represented by the bank, NBG (National Bank 
of Greece), that provides the user stories. 
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The configuration and roles by each partner in this pilot consists of: NBG (as 
Bank and Business Owner) provides customer's data, UBI(Ubitech) process the data 
through the Data Management and Data Processing layer, UBI inserts this data into 
the datastore software provided by LXS (LeanXcale). AI algorithms (NBG), utiliz- 
ing sentiment analysis data by ReportBrain(RB). University of Glasgow is now also 
participating enhancing AI algorithms. Finally, a final user application developed 
by CP (Crowdpolicy) will show the desired information and recommendations. 

A high-level view of the functional architecture is described below: 


* A data storage layer, including tools and infrastructures aimed at data collec- 
tion from different sources and in different formats, and their storage. 

e A data ingestion/preparation layer, including technical components aimed 
at normalising and aggregating the data needed for this specific analytical 
purpose, preparing the information to be processed by the Machine Learning 
tools. 

* A machine learning engine layer, including Natural Language Understanding 
algorithms, opportunely configured for the use case purposes. 


This pilot will allow the screening of extensive documentation in real time. This 
will be a starting point for the optimization of solutions that every single bank can 
possibly adopt and adapt in their own context. The pilot will involve a community 


of banks, which will: 


* Provide data-set related to internal documentation. 

* Provide information and addressing issues around the usage of common tax- 
onomy or glossaries to build a classification ad analysis model. 

* Participate to the requirement identification and service evaluation. 


The development (and also the training plans for the AI models) will be driven 
by ABI Lab, supported by the members of the AI Hub community. 

The banks will be the final users, keeping into consideration that the objective of 
the pilot is to develop an experimental prototype that will be the object of further 
analysis by the participant stakeholders. 


Components 


The main technological components that will be implemented and integrated as 
part of this pilot are: 


* A data storage layer, including tools and infrastructures aimed at data collec- 
tion from different sources and in different formats, and their storage; 

* A data ingestion/preparation layer, including technical components aimed at 
normalising and aggregating the data that we need for our specific analytical 
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purposes, preparing the information to be processed by the Machine Learning 
tools; 

* A machine learning engine layer, including Natural Language Understanding 
algorithms, opportunely configured for the use case purposes. 


Conclusions - Issues and Barriers 


This pilot will allow the screening of extensive documentation in real time. This 
will be a starting point for the optimization of solutions that every single bank can 
possibly adopt and adapt in their own context. Indeed, the scope of the pilot could 
arise some foreseen challenges, mentioned below: 


e Put together multiple banking, technical and academic stakeholder to achieve 
shared objectives. 

* Harmonization of semantic representational models in the context of finan- 
cial services. 

* Exploitation of data assets. 


3.3 Personalized Usage Based Insurance Products 


Pilot #11: Personalized Insurance Products Based on loT 
Connected Vehicles 


In a few words, this pilot aims to develop new services for driving insurance com- 
panies, based on the information gathered from a connected vehicle, as an IoT 
ecosystem. Current driving insurance services try to reward good drivers against 
the "bad one", but based on very static or historical information: your age, colour 
of your car, incidents by year, etc. A new approach, more dynamic, adapted and 
custom services are needed. You pay as you drive, in a similar approach to a cloud 
word, where you pay as you consume. Complementary to this, a second service will 
help to detect possible fraud's situation. Fraud causes not fair costs to the company 
that would affect indirectly to the good/ honest drivers. 

In both use cases the underline technology is based on connected vehicles, IoT 
and BigData, because of the expected amount of data to be managed. The busi- 
ness analysis part, which determines how good driver you are, and the detection of 
possible frauds, will be based on AI and ML techniques. Due to the personal data 
managed in the pilot, security and privacy will be also a technology challenge to 
achieve. 

This pilot focuses on car insurance and risk analysis by developing two AI pow- 
ered services: Pay as you Drive, that allows the insurance company to adapt prices by 
classifying the driver according to the way he/she drives; and the Fraud Detection 
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Figure 3.36. Pilot #11 Personalized insurance products based on loT connected vehicles 
overview. 
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which helps to identify the actual driver of a vehicle involved in an incident. These 
two services rely on a driving profiling tool that requires datasets from connected 
vehicles to define, identify and train the different profiles as ML models. Other 
external data sources, such as traffic incidents or weather, will be used to classify 
the driver, contextualizing its assigned driving profile. An overview of Pilot #11 is 
given Figure 3.36. 


Technological components and Services 


Going a step beyond the Pilots RA towards the functional overview shown in 
Figure 3.37, the High Level Architecture represents the software components that 
build the Pilots use cases. This figure has been used to identify hardware require- 
ments and to describe the technologies behind its principal components. This doc- 
ument links the shown software components with the corresponding RA layers, 
providing some details about their implementation. In this sense: 

IoT infrastructures supply raw datasets required for the implementation of the 
final services. The pilot has already identified and linked connected vehicles (real 
and simulated); weather stations (from AEMET); roads (from OpenStreetMap) 
and traffic alerts. 

Data Collection & Aggregation and Data Normalization components (Data 
Management and Protection layers of the RA): based on NGSI-LD and FIWARE 
Data models, define the rules to ingest data from IoT infrastructures. First func- 
tional versions for the identified IoT sources are deployed and ingesting data. 
Remark here the work done to integrate the Simulation of Urban Mobility 
(SUMO) tool with the Pilots framework, following the NGSI guidelines. Also in 
this layer, Gradiant's Anonymizer tool analyses and anonymises (when required) 
the collected data before being uploaded. 
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Figure 3.37. Pilot #11 high level architecture. 


Connected Car framework (Data Processing layer in RA): composed by the 
FIWARE Orion Context Broker, that supports all context management functional- 
ities (context information broker), and an instance of the FIWARE QuantumLeap 
General Enabler (context information persistence) that supports historical infor- 
mation management. A first instance, covering these two components, has been 
implemented and deployed in Atos' infrastructure. 

EASIER-AI component (RA Analytics layer) is a Hybrid (Cloud/Edge) under 
development framework that facilitates to develop, measure, monitor and deploy 
customised AI models. It is built on top of the Elastic Search, Kibana and Tensor- 
Flow slate of three and enables different ML/DL technologies deployment. On top 
of this, Pilot #11 is developing (and will train) the Driving Profiling and Driver 
Classification inferencers (User Interaction RA layer) that will support the "Pay as 
You Drive" and the *Fraud Detection" services (Visualization RA Layer). 
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The access to these frameworks (Connected Car and EASIER-AI) is protected by 
an OAuth identification and authentication component that relies on the FIWARE 
KeyRock IdM. SSL/TLS is used to protect communications. This is deployed and 
integrated with the Connected Car framework. 


Testbed 


Pilots #11 final deployment relies on UNINOVA infrastructure. Further details of 
the software/hardware first analysis and their results can be found in Figure 3.34. 

UNINOVA infrastructure it's currently being dimensioned to provide support 
to several clusters, so it is not still available for deployments. Pilot #11 demonstrator 
is being deployed within ATOS premises. All first versions of the P#11 components 
follow an approach combining docker and kubernetes for their deployment to make 
easier the migration to the final testbed location. 


Others non-technical requirements 


Besides the technical requirements that compose the core pilots platform, the 
AI technologies deployment and the data collection, an additional and relevant 
requirement has been identified to obtain the best outcomes. This is related to the 
availability of enough data sources (Figure 3.38). 

Anonymous Connected Cars vehicles, that will provide the routes (and vehicle 
data) needed to define, train and evolve the different AI models (and ML/DL 
technologies). The more vehicles enrolled, the better models obtained, but, on 
the same side, the more vehicles reporting around the same area, the better traffic 
models can be created and so, better driver classifications can be performed. In this 
line, the pilot will get 20 connected vehicles, mounting an smart on board unit 
that captures data from the CAN bus of the vehicle (technical vehicle data, such as 
speed, acceleration, systems status, etc.) plus an NMEA unit to capture GPS vehi- 
cles location. These vehicles will start driving, supported by CTAG infrastructure, 
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Figure 3.38. Pilot's #11 hardware testbed. 
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next Feb. 2021 and it is planned to report connected vehicles’ datasets for 4 hours 
a day and for at least 1 year long. 


Implementation of a first Proof of Concept 


First P£11 demonstrator is focused on data collection and homogenisation process, 
in order to identify any potential issue (or required data set) that may impact on the 
subsequent Pilots steps. This will also provide with fundamental elements the AI 
modelling stage. Figure 3.39 presents the functional diagram of the developed PoC. 

As centred in data gathering, the ATOS Connected Car framework will be the 
core component to test and evolve. As mentioned above, this components’ set is 
mostly deployed in Atos Infrastructure, with support from CTAG to build and 
deploy their own data adaptors for their vehicles. In this sense: 


e Data adaptors’ first versions (based on NGSI and FIWARE Data models) are 
deployed, ingesting data from the painted data sources. 

* Connected Car core framework (Context Broker and Historical Repository 
based on FIWARE) is also ready, managing ingested context information. An 
NGSI-LD REST API is ready to access collected data. 

* Identification and Authentication layer, based on FIWARE KeyRock IdM, 
is, in turn, managing Oauth tokens to grant access to the framework. 


With all these components up & running, some dashboards are being devel- 
oped in order to present the collected data and to start the data analytics processes 
(Figure 3.40). These will lead to identify the best AI approach to work on the Pilot's 


final services. 


Expected Outcomes 


* Provide personalized insurance plans: Pay as you Drive/Usage-based insur- 
ance. 

e Collect additional data about the status of the connected vehicle and reaction 
of the drive (in case e.g. an accident), improving the capabilities of fraud 
detection. 

* Provide a more effective and dynamic billing system. 


The following figure illustrates the schematic overview of the Pilot #11 workflow. 


Datasets 


The main data source in the pilot is produced by the connected vehicle, with about 
20 vehicles. It is under study, the inclusion of some historical data provided by 
vehicles from other previous project; if legally possible. The data produced by 
the connected vehicle includes: CAN data, traffic events, gps, speed, etc. Com- 
plemented with data provided by the city of Vigo. 
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Figure 3.40. Pilot #11: personalized insurance products based on loT connected vehicles. 


Finally, the data will be complemented with some synthetic/simulations of vehi- 
cles trips. Based on an opensource tool, SUMO and a custom developed adaptor 
to integrate and transform the data, according to expected data pipeline and data 
standards. 

About data standards, the Smart Fleet platform layer, in charge of gathering, 
homogenizing, filtering, etc, is based on a FIWARE platform. Therefore, FIWARE 
Data Models will be used during the project. In that point, it is expected to con- 
tribute back to these standardization efforts fostered by the FIWARE Foundation. 
Some models would be adapted, or new ones would be created. 

The pilot will make use of the following dataset: 


e Simulated Urban Mobility Dataset (ATOS, ~368 GB): Simulated Urban 
mobility data (mainly vehicles CAN Signals) through different scenarios 
(cities). Captured from SUMO tool. 

* CAN Data (Historical Data) (CTAG): Data collected from vehicles CAN 
Bus (20 vehicles driving 4 h/day 1 year). Historical data coming from existing 
deployments. 

e Traffic Events (Historical data) (CTAG, ~900 GB): Traffic events published 
by the city of Vigo and DGT (Historical data related to captured CAN Data). 
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* NMEA Data for vehicles (Historical) (CTAG, ~120 GB): Complemen- 
tary location (GPS, Timestamp, speed, heading...) for Vehicles CAN Data 
(Historical data related to captured CAN Data). 

e CAN Signals (Live) (CTAG, 7150 GB): CAN data + Driving style info 
(revolutions, gear, hard breaking...) + Parking (close doors, windows...) + 
Maintenance. 

e Traffic Events (Live) (CTAG, ~250 GB): Traffic events published by the city 
of Vigo and DGT. 

* NMEA Data for vehicles (Live) (CTAG, ~50 GB): Complementary location 
(GPS, Timestamp, speed, heading...) for Vehicles CAN Signal. 

* Motor Insurance Data (DYN, ~500 MB): Data concerning motor insurance 
including data from the policies (duration, covers), data from vehicles (licence 
No, VIN etc.) and data from drivers (age, experience etc.). 


Data Produced 


Two main business services will be produced during the pilots implementation. 
Therefore, it is not so focused on producing data, but, to provide services use. These 
services will be used, internally, by the insurance company. In any case, the data 
produced (or the results) by these services would be considered as data produced, 
that can be stored in a database, to feed new chains/workflows. 


e Pay-As-You-Drive service: 

o Input: drivers trip info 

o Output: a value from 0 to 100 about the driver's behaviour. 
* Fraud detection: 

o Input: drivers trip info 

o Output: a kind of driver. 


It would be used to compare the kind of driver against an historical register. 
Example of usage in case of an accident: it would check if the kind of driver differs 
from previous days (stolen vehicle, identity theft). 


Explainable Workflow 


The data collected from the vehicle is transmitted to the INFINITECH Testbed, 
where the data is pipelined into a workflow with a set of steps. Before going to the 
Smart Fleet platform, data is prepared about regulation and anonymized to pro- 
tect the drivers privacy. With the data prepared to be managed, the Smart Fleet 
Platform homogenize, filter, clean, and standardize the data (based on FIWARE 
Data models). Here the data is prepared as time series for real time management, 
or, it is stored as historical information. Looking at the AI Platform, it is expected 
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to develop/train two different ML models for the two business services. Once the 
models have been implemented and these are available in the platform, these mod- 
els will be trained, supported by the previous data gathering workflow. Getting 
the training data from the Smart Fleet Platform. It is important to clarify that the 
training process is not a matter of getting data, training and finish. The model will 
be constantly trained according to specific scheduling. The model will be always 
updated with the new data that constantly is generated by the connected car: 
(1) Data Management: data produced -> preprepared - gathered -> streamed 
of store (2) (scheduling time raises) (3) ML Model training: data features extrac- 
tion - training model -> store the model The usage of the model, or inference 
service, or business service, it is an independent workflow. It just deploy a service 
that will exploit the previously trained model. These are interconnected, the first 
time the services are deployed with the model, this is linked to future training. 
When new training succeeds with more accurate models, the inference service will 
update the resulting model automatically. 


Logical Schema 
An INFINITECH-RA compliant architecture of the pilot is depicted in the fol- 
lowing Figure 3.41: 

The pilot's Reference Architecture and main data flows have been presented and 


detailed. This RA can be simplified considering: 


^ baia 


Figure 3.41. Personalized insurance products based on loT connected vehicles pilot 
pipeline in-line with the IRA. 
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* A Data Management layer, that selects, captures and curates the data sources 
required to implement the pilots functionalities. On this first stage, real 
connected vehicles and simulated traffic routes are the main implemented 
sources, assisted by weather information and traffic incidents collected for 
the area where the real vehicles will be driving. 

e A Data Protection and Data Processing Layers in charge of homogenise 
and store all data collected, according specific data models (provided by 
FIWARE), so these are available for the analytics processes. Here are also 
included all the operations needed to anonymise/pseudoanonymise (as 
required) the captured data and protect this information from unauthorised 
accesses as well as data uploading from untrusted sources. 

* An Analytics block, fed by the data layers, were different ML/DL technologies 
and visualization tools will enable data monitoring, analysis, and exploitation. 
Two main AI models (and inferencers) will be developed here, the Driving 
profiling and Driver Classifier tools (RA User Interaction), that will come up 
with the final services: “Pay as you Drive” and the “Fraud Detection” (in the 
RAS Visualization layer). 


The main stakeholders for this pilot are the insurance (car) companies and their 
insured drivers, who will exploit the driving profiles and drivers’ classifications and 
benefit from customised prices respectively. In Pilot #11 these stakeholders will be 
represented by Dynamis (DYN) that provides the user stories. 

To complete this pilot: Automotive Technology Centre of Galicia (CTAG) man- 
ages the real drivers’ enrolment and real connected cars, plus traffic incidences 
around driving areas; Atos (ATOS) provides the pilot’s core platform, including 
the traffic simulation tool and the weather datasets. It also develops the AI mod- 
els to implement the final services; and Gradient (GRAD) that implements the 
Anonymization Tool and takes care of all the data managed within Pilot #11 to be 
GDPR compliant. 


Components 


The main components to be used in the pilot include: 


e Smart Fleet Framework (Context Broker) (Data Management in RA). 

* Smart Fleet Framework (PeP Proxy) (Data Security and Privacy in RA). 
* Smart Fleet Framework (Historical DB: CrateDB) (Data Source in RA). 
e Smart Fleet Framework (Context DB: Mongo) (Data Source in RA). 

* Smart Fleet Framework (QuantumLeap) (Data Source in RA). 

* Smart Fleet Framework (Weather Injector) (Data Ingestion in RA). 

e Smart Fleet Framework (IoT Agent) (Internet of Things in RA). 

* Smart Fleet Framework (Grafana) (Interface in RA). 
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* Security Framework (IDM) (Data Security and Privacy in RA). 

* Anonymiser (GRAD Anonymiser) (Data Security and Privacy in RA). 
e EASIER.AI (Elasticsearch) (Data Management in RA). 

e EASIER.AI (kibana) (Analytics and Machine Learning in RA). 

e EASIER.AI (logstash) (Data Ingestion in RA). 

* Pay as You Drive Service (Interface in RA). 

* Fraud Detection Service (Interface in RA). 


Conclusions - Issues and Barriers 


Based on the work done so far, the main foreseen challenges would include: 


* Gathering enough and relevant datasets from vehicles that allow the system to 
define and detect a wide enough set of profiles that cover most of the driver's 
population. 

* Identify the proper correlations and relevant parameters from the collected 
datasets that better define and differentiate the profiles and so, the AI models 
to infer them. 

* The mapping between the drivers’ profiles and the context information to 
provide accurate risks estimations. 

* Theavailability of real datasets (connected cars) from insured drivers to match 
pilots’ services and exploit the results. 


Pilot #12: Real World Data for Novel Insurance Products 


Risk assessment is an integral part of the insurance industry, but it is usually static, 
done at the beginning of a contract with a client. The continuous estimation of 
risk factors is the aim of this pilot, an estimation based not just on medical history, 
but on lifestyle and behaviour, as they are continuously monitored. This allows 
the insurance companies to offer personalized dynamic products, where clients 
premiums are calculated dynamically based on their habits. Complementary to this, 
a second service will help to detect possible fraud’s situation. Fraud causes not fair 
costs to the company that would affect indirectly to the good/honest clients. 

In both use cases the underline technology is based on analysing Real-World 
Data (RWD) of the clients. The business analysis part, which determines how 
healthy a client’s lifestyle is, and the detection of possible frauds, will be based on 
ML techniques. Due to the personal data managed in the pilot, security and privacy 
will also play an important role (Figure 3.42). 

Pilot 412 focuses on health insurance and risk analysis by developing two AI- 
powered services: Risk assessment, that allows the insurance company to adapt 
prices by classifying the client according to their lifestyle; and the Fraud Detection 
which helps to identify fraudulent behaviour of the clients in using the activity 
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Figure 3.42. Pilot #12 real world data for novel health insurance products overview. 


trackers and answering the questionnaires. These two services rely on a people mod- 
elling that requires actual data and simulated persons to train. An overview of Pilot 
#12 is given in the Figure. 

Current health insurance services are based on medical history and very static 
information. The innovation of Pilot #12 lies on applying new technologies (IoT 
and AI) to provide more dynamic and customized services. 


Technological components and Services 


The High-Level Architecture presents the software components that build the 
Pilots use cases (Figure 3.43). This book serieslinks the shown software compo- 
nents with the corresponding RA layers, providing some details about their imple- 
mentation. In this sense: 


* [oT infrastructures supply raw datasets required for the implementation of 
the final services. The pilot has already identified and linked the Healthen- 
tia platform for Real-World Data collection and the RWD Simulator, both 
from iSprint as the data sources. Clients records maintained by the insurance 
companies are still under investigation. 

* Data Collection & Aggregation and Data Normalization components (Data 
Management and Protection layers of the RA): Healthentia platform already 
handles ingestion from IoT infrastructures. Remark here the work done 
to implement and integrate the Real World Data (RWD) Simulator. Also, 
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Figure 3.43. Pilot’s #12 hardware testbed. 
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Gradiants Anonymizer tool analyses and anonymises (when required) the 
collected data before being uploaded. 

* The ML component (RA Analytics layer) builds upon Scikit-Learn and 
Keras/TensorFlow the subjects’ profiling and subjects’ classification infer- 
encers (User Interaction RA layer) that will support the two services (Visual- 
ization RA Layer). 


Testbed 


Pilots #12 deployment relies on UNINOVA infrastructure, as detailed. Further 
details of the software/hardware first analysis and their results are summarised in 
Figure 3.44. 

UNINOVA infrastructure it's currently being dimensioned to provide support 
to several clusters, so it is not still available for deployments. Pilot’s #12 demonstra- 
tor is currently being designed following a docker+kubernetes approach for their 
deployment to facilitate possible initial deployment at a temporary server and final 
deployment at the UNINOVA testbed. 


Other non-technical requirements 


The success of pilot 412 depends on the wealth of data made available for training 
and inference. Data is obtained from users employing measurement devices and the 
will to participate in the pilot by using the measurement infrastructure, reporting 
symptoms, liquids and meals, and answering the questionnaires. The RWD Simu- 
lator is being built to fill in the necessary data volume, but its models depend on 
the observations made on the actual data being collected. 
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Figure 3.44. Pilot #12: real world data for novel insurance products. 


Implementation of a first Proof of Concept 


The primary focus of the Proof of Concept demonstrator of Pilot #12 is data col- 
lection: what to measure, what to ask for, how to collect and how to simulate. 
Secondary points of focus are the pilot’s testbed and the risk analysis service. 

Pilot 12 data collection is based on the Healthentia platform, by Innovation 
Sprint. Healthentia is an eClinical system that comprises mobile apps at the data 
source (the pilot participants and a platform for collecting the data. A postal app 
allows data visualization. The pilot’s first goal has been to repurpose Healthentia 
from the clinical to the health insurance domain. To this extend, the data col- 
lected, and the questionnaires forwarded to the pilot participants have been selected 
and defined. Currently we collect physiological data from four possible sources, a 
Garmin connector, a Fitbit connector, an Apple Health Kit connector and a propri- 
etary Android sensing service. Our questionnaires span symptoms, liquid and food 
consumption and the selfassessment of quality of life and health, the EQ-5D-5L 
questionnaire. 

Data are also being provided by the RWD Simulator built for INFINITECH. 
The simulator accepts people's personality traits and health profiles whilst simulates 
their activities and the corresponding measurements and questionnaire answers. 
The simulator data have the exact same structure as the actual ones and are also 
collected by Healthentia. 

Regarding the pilot's testbed, a temporary setup is being managed by Innovation 
Sprint using a Linux 2020LTS server at Hetzner. It is a VM with 2 vCPUs, 8 GB 
RAM and 80 GB storage (CX31 instance). There Ubitech’s Data Capturing Tool 
has been configured to capture data from the Healthentia API and store it in the 
LeanXcale DB. 
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Finally, regarding the risk analysis service, classifiers have been built using the 
simulated data to predict if the health of a person is expected to improve or not 
during a week, based on the week's measurements and reports. Both Random Forest 
and fully connected Neural Networks classifiers have been trained, with the NN one 
performing slightly better, achieving 7896 correct identification of the health trend. 


Expected Outcomes 


The pilot will produce a personalized life insurance system, with interfaces for both 
citizens and insurance companies. 


* Dynamic individualized adaptation of coverage and pricing according to 
client's behaviour and automated risk calculation. 

* Fraud Detection as a mechanism that analyze client's behaviour with the aim 
of historical data. 

* Automated data privacy risk assessment and mitigation. 


Figure below illustrates the Pilot £12 workflow in detail. It comprises of north, 
south, east and west bound APIs and different layers applied in the pilot. 


Datasets 


The main data source in the pilot is the RWD collected by Healthentia. Healthentia 
is a platform for measuring and reporting RWD. Measurements are based on sen- 
sors on smartphones or IoT wearable devices. Reports employ questionnaires that 
the clients periodically answer utilising the Healthentia app. A secondary source of 
data is the records of the clients of the health insurance companies. Finally, the data 
will be complemented with synthetic/simulated data. 


e Healthentia Live (average 720kB per user per week): Measured physical activ- 
ity (steps, floors, sleep and heart rate) and user reported data from users of 

e Healthentia SaaS who have given consent Healthentia Simulated (average 
720kB per user per week): Simulated physical activity and reported data 

* Activity tracking datasets based on 100s of individuals/users that will be 
engaged in the pilot by RRD 

e 100s of Citizens’ feedback datasets 

* 1000s Nutritional information datasets 

e Simulated of activity datasets from 1000s of patients based on the simulation 
module of the Healthentia platform 


Data Produced 


Two main business services will be produced during the pilots implementation. 
Therefore, it is not focused on producing data, but on service provision. These 
services will be used by the insurance company. To have these services, the ML 
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module will be producing models which can be considered as data produced, that 
can be stored in a database, to feed new chains/workflows. 


e Risk assessment service: 


o Input: clients lifestyle, enumerated by long-term, short-term averages and 
trends of physiological parameters that have to do with activity, sleep, 
the heart, nutrition, hydration, body signals (blood presure, temperature), 
weight and symptoms (pain, fatigue, diarrhea, nausea, cough). 

o Output: decisions on health outlook are accumulated across time, forming 
a health assessment ranginh from —100 to 4-100. 


* Fraud detection: 


o Input: clients lifestyle enumerated as above, models of all clients. 
o Output: probability of fraud, enumerating mismatch of current behavior 
from past behavior of client and other clients. 


Explainable Workflow 


The RWD collected from the client using Healthentia and the secondary sources 
is transmitted to the INFINITECH Testbed, where they are aggregated together, 
anonymised for protection and stored. Stored data are either used to (re)train the 
risk and fraud assessment models. The trained models are used by the services 
on input data without anonymisation to provide the risk and fraud assessments. 
The outputs of the services are offered to the health insurance professionals via 
the presentation layer of the pilot, together with all collected RWD for human 
insights/verification. The presentation layer is the Healthentia portal app. 


Logical Schema 


An initial mapping of the pilot architecture to the INFINITECH-RA is depicted 
in the following Figure 3.45. 


Figure 3.45. Real world data for novel health-insurance products pilot pipeline in-line 
with the IRA. 
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Figure 3.46. Pilot #12 high level architecture. 


Pilots Reference Architecture and main data flows have been presented (Fig- 
ure 3.46). This RA can be simplified considering: 


* A Data Management layer, that selects, captures and curates the data from 
the actual and the simulated people. 
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* A Data Protection and Data Processing Layers in charge of homogenise and 
store all collected data, so these are available for the analytics processes. Here 
are also included all the operations needed to anonymise/pseudoanonymise 
(as required) the captured data and protect this information from unautho- 
rised accesses. 

* An Analytics layer, fed by the data layers, where different ML/DL tech- 
nologies and visualization tools will enable data monitoring, analysis and 
exploitation. Two main AI models (and inferencers) will be developed here, 
the subject profiling and subject classifier tools, that will come up with the 
final services in the RA's Visualization layer. 


The main stakeholders for this pilot are the health insurance companies and 
their insured clients, who will exploit the subjects’ profiles and subjects’ classifica- 
tions and benefit from customised prices respectively. In Pilot £12 these stakehold- 
ers will be represented by Dynamis (DYN) that provides the user stories. To com- 
plete this pilot: Roessingh Research and Development (RRD) manages the real sub- 
jects’ enrolment; Innovation Sprint (iSprint) provides the data collection platform 
and the subject simulator. Singular Logic (SiLo) and Innovation Sprint (iSprint) 
develop the AI models to implement the final services; and Gradient (GRAD) 
implements the Anonymization Tool and takes care of all the data managed within 
Pilot 212 to be GDPR compliant. 


Components 


The following components will be deployed and used as part of the pilot: 


* UBITECH Data Capturing Tool (Data Ingestion in RA). 

* LeanXcale Database (Data Management in RA). 

* Innovation Sprints ML services (risk assessment and fraud detection) 
(Analytics and Machine Learning in RA). 

e ATOS Regulatory tool through Data protection Orchestrator (DPO) (Data 
Security and Privacy in RA). 

* GRAD Regulatory tool through Anonymization Component (Data Security 
and Privacy in RA). 


Conclusions - Issues and Barriers 


The PoC of Pilot 12 allowed us to implement the data collection system, addressing 
both, the what and the how. The low engagement of the PoC participants (about 
50%) is alarming and will be addressed in the Data Sharing Acceptance and Usabil- 
ity Study from the data, privacy and UI/UX aspects. It is also being addressed tech- 
nically by increasing the measurement options and optimising the Android sensing 
service. Our aim is to be gathering soon enough and relevant data from diverse 
users to facilitate both risk assessment and fraud detection services. 
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The testbed will be transferred to its permanent location at the NOVA server, but 
the PoC already set in motion all the collaborations necessary for its setup amongst 
the INFINITECH partners not members of the pilot. 

The risk assessment service has been addressed at the PoC via an initial predic- 
tor of weekly variations of health. Both classifiers and regressors will be built in 
the coming months, the feature vector used to train them will be optimised and 
as a result the heart of the service will be in place. The fraud detection has not 
been addressed yet, and this is a concern, since today people cheat on their activity 
trackers just to get a badge in their favourite wellness app. This could escalate when 
health insurance discounts are involved. 


3.4 Predictive Financial Crime and Fraud Detection 
Pilots 


Pilot 47: Avoiding Financial Crime 


The aim of this pilot is to see if we can detect Financial Crime more accurately and 
sooner than any existing system by using AI and advanced computational power 
abilities. The goal of Operation Whitetail is to explore how next generation tech- 
nical solutions like Machine Learning and AI could help to create a more accurate, 
comprehensive and near real-time picture of suspicious behaviour in the Financial 
Crime remit (Anti-Money Laundering and Combat Terrorist Financing). The goal 
is to explore more accurate, comprehensive and near real-time pictures of suspicious 
behavior in Financial Crime, Fraud, in the use case of instant loans. Such loans can 
be requested online and are subject to fraud and crime, e.g. identity theft. Based 
on comprehensive data including KYC and transaction data a financial crime risk 
score is calculated by AI/ML algorithms. This way the instant loan can be approved 
or denied related to this score. 
Within the pilot the following processes are addressed: 


* KYC (Know Your Customer), for screening the vast amount of available data 
sources in near-real time, to ensure that KYC data is automatically updated 
to the most recent information available on the customer facilitating data 
quality. 

e Customer risk profiling, based on feeding the transaction-based customer's 
behavioural profile data and KYC results leading to an advanced risk score 
that could provide a holistic customer risk profile and will enable the business 
to respond quicker to newly identified risk and changes in criminal behavior. 


Therefore, the pilot plans to utilize use synthetic or anonymized data as source. 
Bank internal and bank external sources of KYC data shall be joined in an advanced 
KYC data store. 
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The external data sources include public sources or sources actively shared by 
the customer. Information from external sources will be obtained traditionally, e.g. 
credit reference agencies; sanctions lists. 

The advanced KYC data are used to extract a customer profile. Additionally, 
customer transactions patterns are extracted from the customers transaction data. 


Expected Outcomes 


* Better detection of suspicious customers and transactions (transaction mon- 
itoring). 

* Fewer false positives. 

e Near-real time update of customers extended KYC profile. 

* Near-real time update of the customer's behavioural profile. 

* New, holistic risk score based on a complex risk model. 

* Analyses financial crime alerts/anomalies more effective and efficiently. 


Figure 3.47 illustrates the design architecture of the pilot #7 workflow. 


Datasets 


e Transactions and Customer attributes (anonymized). 

* The pilot will use synthetic or anonymized data as source. In the bank internal 
data pool sources will be accessed. This data pool also includes bank inter- 
nal and external KYC data and internal transactional data. These data shall 
be joined in an advanced KYC data source and the relevant data for the use 
case will be extracted from that Due to compliance rules, these data need to 
be treated confidential. In a 1st step use related data representing customer 
profiles will be extracted facilitating the development of synthesized data sets 
giving insight to the financial crime risk score and facilitating the develop- 
ment of AI/ML models. 


Data Produced 


The pilot will produce data giving insight to the financial crime, i.e. instant loan, 
risk score. This may include a risk score, customer data, transaction patterns and 
details. The detailed data, which will be presented, are yet to be specified depending 
on the advice of Financial Crime experts in the bank. 


Explainable Workflow 


Within the pilot the following processes are addressed: KYC (Know Your Cus- 
tomer), for screening the available data sources in near-real time, to ensure that 
KYC data is automatically updated to the most recent information available on 
the customer facilitating data quality. Customer risk profiling, based on feeding 
the transaction-based customer's behavioural profile data and KYC results leading 
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to an advanced risk score that could provide a holistic customer risk profile and 
will enable the business to respond quicker to newly identified risk and changes in 
criminal behaviour. 

The workflow will produce data giving insight to the financial crime risk score. 
This may include a risk score, customer data, transaction patterns and details. The 
detailed data, which shall be produced, are yet to be specified depending on the 
advice of Financial Crime experts in the bank (Figure 3.48). 


Logical Schema 


Figure below summarizes financial crime pilot pipeline in detail. The component 
involved are described in the following section. 


Components 


Due to strict compliance and approval procedures in CXB the pilot operations 
are facilitated splitting the tasks in a pre- and pilot processing part. The pre- 
processing part may be mimicked by INFINTECH tools based on beforehand 
synthesized/anonymized data. However, for a smooth progress of the pilot devel- 
opment, a bank internal and an INFINITECH process will be considered as a first 
step. 

A List of the main components to be deployed and used in the pilot follows: 

Pre-processing — Inside the bank by bank approved tools: 


* Bank Data Pool (Data Sources in the RA). 

* Bank Data Pool Extraction (Data Management in the RA). 
* Bank Data Pool Join (Data Management in the RA). 

* Data synthesation/anonymization (Data Source in the RA). 


Pilot-Processing — The synthesized/anonymized data then are used in the 
INFINITECH Pilot (Figure 3.49) 


* Synthesized/anonymized data (Data Source in the RA). 
* Data Ingestion (Ingestion in the RA). 

* Data Analytics/Scoring (Analytics in the RA). 

* Visualization (Presentation in the RA). 


Pilot #8: Platform for Anti Money Laundering Supervision 
(PAMLS) 


The objective of the Pilot, is to develop a Platform for anti-money laundering 
Supervision (PAMLS), which will improve the effectiveness of the existing supervi- 
sory activities in the area of anti-money laundering and combating terrorist financ- 
ing (AML/CTF) by processing large quantity of data (Big Data) owned by the Bank 
of Slovenia (BOS) and other competent authorities (FIU). 
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Figure 3.50. Pilot #8: components of anti money laundering supervision (PAMLS). 


The book series will develop a platform named PAMLS that will improve 


the effectiveness of the existing supervisory activities in the area of ML/TE, by 
(Figure 3.50): 


Automated and transparent data gathering that will include data quality 
control. 

Improved analysis of big data coming from wide range of different sources 
(e.g. payment transactions, data acquired from the FI; business register etc.). 
Improved Risk Assessment (as an ongoing and cyclical process) with auto- 
mated feeds from big data analysis. 

More cost-efficient risk assessment process due to less time-consuming data 
gathering tasks, assessments of the FI and the financial sector and semi- 
automated features. 

A more effort-efficient risk assessment process, additional resources can be 
focused on the supervision of identified high risks. 


PALMS will consist of four main business services: 


Risk assessment tool: to assess the money laundering and terrorist financing 
(ML/FT) risks of financial institutions (FIs) and the risk of a whole sector to 
support risk based supervision, 

Screening tool: for screening payment transactions, enriched with data from 
business register (ePRS) and transactions accounts register (eRTR), to recog- 
nize unusual patterns that could indicate typologies and risks of ML/FT at 
level of individual FI or the whole sector, 
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* Search engine: allowing supervisor to look for a specific transaction or a sam- 
ple of transactions, 

* Distribution channel: for secure gathering data that will feed risk assessment 
tool and screening tool. 


* Risk Assessment tool: which provide risk assessment functionalities within 
PAMLS. 


Technological components and Services 


Following a list of components and grouped by the Reference Architecture layers 
for Data Analytics and User Interfaces. 
Data Analytics Layer main components: 


* Risk Calculation engine and Complex search services, which will be imple- 
mented specifically for Pilot8 requirements and therefore will be tailored to 


BOS specific: 


o Current status: 1st version developed on scrambled data 
o Next version: M27 


* Anomaly detection & prediction analysis, which will provide functionali- 
ties for anomaly detection and prediction for time series data including Pat- 
tern analysis, which will provide analytical services on data graphs, including 
detection of complex patterns on data graphs: 


o Current status: to be developed 
o First version: M27 


e Stream story is a component for the analysis of multivariate time series. It 
computes and visualizes a hierarchical Markov chain model which captures 
the qualitative behaviour of the systems’ dynamics, where system is described 
with a group of time series. 


o Current status: to be developed 
o First version: M27 


User Interfaces Layer main components: 


* Risk Assessments tool — 1 st version already developed, next version M27. 


Expected Outcomes 


* Automated, more accurate and more dynamic detection of money laundering 
transactions. 

e Scalable, multimodal data platform, compliant with legal and regulatory 
framework, measured KPIs associated with the volume of data processed, the 
speed of processing and the effort required for processing will be tracked. 
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Datasets 


Relevant datasets, planned to be analyzed within PAMLS are: 


e TARGET? transactions: 


o Transactions executed by the Slovenian payment institutions within 
TARGET2 (TransEuropean Automated Real-time Gross Settlement 
Express Transfer System) 

o High value (above 50.000 EUR), urgent transactions in EUR 

o ‘Transactions processed through BOS payment systems (responsible BOS 
Payment Settlement and Systems department — PPS) Confidential data. 


e SEPA transactions: 


o ‘Transactions executed by the Slovenian payment institutions within SEPA 
(Single Euro Payments Area) 

o Domestic and international transactions within SEPA area in EUR under 
50.000EUR value 

o Transactions processed through payment systems by third party provider 
Confidential data. 


e FIU transactions (public data): 
o Transactions related to high risk countries above 15.000 EUR reported to 
the Slovene 
o Financial Intelligence Unit (FIU) Public data. 
e FI identification data 
o Identification information about Financial Institution (FI) 
o Aggregated statistical data on the FI inherent risk and control environment 
(number of clients, number of Suspicious transactions reports (STR) etc.) 
o Flreports to the BOS (reports are confidential) Confidential data. 
* ePRS data 
o Slovenian Business Register (public data on legal entities) Public data. 


* eRTR data 


o Slovenian Transactions Accounts Register (public data on legal entities) 
Public data. 
* High risk country list 
o List of countries defined as high risk due to lack of or not effective 
AML/CTFT system 
o List is managed and published by the Slovene FIU (public data) Public 


data. Personal data will be anonymized by the source, prior data delivery 
to PAMLS. 
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Data Produced 


Ongoing risk assessment for the purpose of the Anti-Money Laundering and Com- 
bating Terrorist Financing Supervision over the FI and FI sector. 


Explainable Workflow 


PAMLS will use various data sources, which we can divide in to three groups: 

The first group consists of transactional data process through payment service 
providers in Slovenia (TARGET and SEPA transactions). This group will first be 
enriched with ePRS and eRTR data, and then pseudo-anonymized (for end user 
anonymized). Before data will be stored in PAMLS internal data storage it will also 
be joined with High risk country List. 

The second set of data sources represents public data (FIU transactions) that will 
also be enriched with ePRS and eRTR data and than joined with High risk country 
List and stored in PAMLS internal data storage. 

Third group of data sources represents FI data (data on FI inherent risk and con- 
trol environment), which will stored in PAMLS internal data storage after positive 
Data Quality Check. 

After data is ingested in PAMLS platform, it needs to be preprocessed in a way, 
that information is properly enriched and it needs to be provided in a suitable data 
format (vectors, graphs). Process of feature engineering, tailored to specific goals, 
will follow. PAMLS will develop and test novel approaches for detecting unusual 
patterns of ML/TE which could be labelled as high risk later in the process and will 
have an effect on final FI risk assessment. Part of the PAMLS is also Risk Calcu- 
lation engine. There the risk calculation will be continuously calculated on a level 
of a sector or a particular FI, using predefined Risk Assessment methodology. To 
empower bank analysis, to develop and test novel approaches, PAMLS provides 
three components: Stream story, Pattern discovery & matching, Anomaly detec- 
tion & prediction. With introduction of enriched graph topologies and a hier- 
archical Markov chain models, PAMLS will capture the qualitative behaviour of 
the systems’ dynamics and enable analyst to discover new regularities and corre- 
lations on a larger scale. These components will enable iterative development of 
potentially new upgrades to existing Risk Assessment methodology and discovery 
of novel or additional money laundering and terrorist financing typologies. PAMLS 
will also provide three different user interfaces, which corresponds to 3 different 
use cases. 


Logical Schema 


The following figure illustrates an initial logical mapping of the pilot components 
to the layers and pipelines approach of the INFINITECH-RA (Figure 3.51). 
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in-line with the IRA. 


Figure 3.51 PAMLS pilot pipeline 
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The components implemented for the first PoC include: 


e Data sources: 


o Synthetic FI data (data about inherent risk and their control environment) 
o Implemented APIs 
o Implemented DQ Ist version 


* Risk methodology — framework 

* Risk engine: 
o Defined technical requirements (flexibility, scalability, DQ..) 
o lst version implemented 
o Validation & verification of risk calculations 


Additionally, the Stream Story component was used to search data for meaning- 
ful patterns, however, since it was applied on scrambles data, meaningful verifica- 
tion of data patterns was not possible. 


Testbed 


Pilot#8 will be hosted at the Testbed on the premises of the BOS, it is ready and 
it has already deployed the software components and data to implement the PoC. 
The testbed has been specified in the following manner: 


Hardware Description 

HP ZÁ G4 WKS CPU: Intel XeonW-2125 4.0 4C 

RAM: 256GB (8x32GB) 

DDR4 Graphic: NVIDIA Quadro P400 2GB (3)mDP Graphics 
Disk: Z Turbo Drv 1TB PCIe NVMe OPAL2 TLC SSD 


Testbed is based on Windows operating system and include software: 


* Libraries (QMiner & SNAP). 

e External tools (candidates): (PostgreSQL/Elastic Search). 

* Programming tools (C/C++ compilers (GNU or Microsoft, Python, 
Node.js). 


Others non-technical requirements 


Due to standard security measures at BOS, in order to use the Pilot#8 testbed phys- 
ical presence of JSI development team at the BOS premises is required. As a con- 
sequence of the strict measures to mitigate the spread of Covid-19 and additional 
security measures external parties do not have granted access to the BOS premises 
(JSI as a partner on Pilot#8 included) (Figure 3.52). During PoC implementation 
the development was done on scrambled data in order to preserve data privacy. 
Therefore PoC implementation was done at JSI site, however validation and initial 
testing was done by BOS in several phases, where risk calculations were validated. 
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Financial Institution PALMS platform 


Figure 3.52. Risk assessment tool data flow. 


At the next phase, PoC will be transferred to the Pilot#8 testbed at BOS site. For 
next development phases of Pilot#8, it is crucial that during initial development 


of AI components appropriate test data is available, while proper validation and 


testing needs to be done on real data at BOS site. 


Implementation of a first Proof of Concept 


In accordance with the development timeline first prototype of the Risk assessment 


tool was developed: 


Risk Assessment Tool — PoC was implemented in agile quick cycles. It provides 


the following functionalities: 


1. Sector Risk Assessment view allows us to review the risk of all financial insti- 


tutions (FIs) based on the assessment of their inherent risk and control envi- 
ronment in the specific year. Based on their risk, FIs are placed in the Risk 
Assessment Matrix in to low — medium — medium high — or high risk. Super- 
visory authority will focus on those presenting higher risk. Since the final 
risk assessment is an evaluation of inherent risk and control environment the 
view also enables graphical schema of those two important elements of the 
risk assessment. Changes in the risks of the specific FI is also an important 
factor. Therefore Sector Risk Assessment view enables also historical view for 
selected FIs. 

Inherent risk/Control environment view: Inherent risk (and similar for con- 
trol environment) consists of different risk areas and those consist from dif- 
ferent elements. In this view supervisor can drill down to the specific elements 
and compare FIs amongst each other. Also, supervisor receives information 
which areas or elements of the inherent risk or of the control environment 
present more risk for a specific FI and can therefore focus on those areas 
during the on-site supervision. 

Bank Profile view enables the supervisor to select a FI for a detailed review. 
In the first version of PoC the view consists of FI basic information (FI ID 
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Figure 3.53. Sector risk assessment view. 
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Figure 3.54. Inherent risk and control environment view. 
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Figure 3.55. Bank profile view. 


Card), graph on the FIs risk assessment changes through the year (historical 
view) and detailed information on FI inherent risk and control environment 


(Figure 3.55). 


Components 


The pilot will use the following components: 


e Risk Calculation engine and Complex search services (Analytics in the RA) 

* Anomaly detection and prediction component (Analytics in the RA): will 
provide functionalities for anomaly detection and prediction for time series 
data including Pattern analysis. The latter will provide analytical services on 
data graphs, including detection of complex patterns on data graphs; 

e StreamStory component (Analytics in the RA): a component for the analysis 
of multivariate time series. It computes and visualizes a hierarchical Markov 
chain model which captures the qualitative behaviour of the systems’ dynam- 
ics, where system is described with a group of time series; 
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e Pattern discovery and matching component (Analytics in the RA) 
e Pseudo-anonimization tool (Data Managenent in the RA) 

* PostgreSQL (Data Managenent in the RA) 

e ElasticSearch (Data Managenent in the RA) 

NEO4J (Presentation in the RA) 


Conclusions - Issues and Barriers 


Development in Pilot#8 is going according to the plan. As PoC we provided 1st 
version of one of the use cases — Risk Assessment tool. It will facilitate supervi- 
sion activities in terms of providing relevant data analysis on the fly. It will enable 
risk analyst to gather and analyse risk assessment data more efficiently and provide 
straightforward analysis of risk methodology on one hand and analysis of Slovenian 
FIs in terms of Inherent and Control risks through the years. It enables comparison 
of particular risk categories and provide detail insights. 

Regulatory requirements used within the Pilot#8 requires additional actions that 
were not foreseen at the start of the project (approval by compliance and manage- 
ment board, anonymization requirements etc.). Although such additional actions 
could affect the pilot development timeline, it does not change planned develop- 
ment and defined use cases set. 


Pilot #9: Analyzing Blockchain Transaction Graphs for 
Fraudulent Activities 


There can be blockchain crypto currencies and tokenized assets (e.g. USD, EUR, 
TRY tokens) that are obtained fraudulently as a result of ransomware and theft of 
funds. These fraudulent assets can go through various transfers on the blockchain 
and enter the regulated environments in different jurisdictions. As a result, it is 
possible that a company may accept deposits of crypto currencies and tokens that 
can be traced to addresses involved in fraudulent activities. 

Pilot #9 is developing a parallel and scalable transaction graph analysis system 
that can construct and operate on the massive Bitcoin and Ethereum blockchain 
transaction graph with distributed dynamic data structures on an HPC cluster. 
During Period 1 of the project, the pilot has implemented parallel graph algorithm 
based fraudulent activity analysis. In the Period 2 of the project, it has also initiated 
implementation of machine learning based analysis algorithms. The pilot is also 
providing a user interface that provides various queries and visualization of results 
using graph drawing package. 

Pilot 49 aims to detect fraudulent activities monitoring blockchain transactions. 
Blockchain crypto currencies and tokenized assets that are obtained fraudulently 
can go through various transfers on the blockchain and end up as stable coins 
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(e.g. USD, EUR, TRY tokens) in different jurisdictions. As a result, it is possible 
that a company that accepts crypto-currencies, or stable coins, is paid by stable coins 
that can be traced to addresses involved in fraudulent activities. Holding crypto- 
currencies or stable coins that originated from fraudulent or sanctioned addresses 
can be risky for the company. Hence, construction of the massive blockchain trans- 
action graph and its analysis is necessary to trace and detect fraudulent addresses. 
Since blockchain data is constantly accumulating and will be growing at increas- 
ing rates in the future, a parallel scalable transaction graph analysis system is being 
developed that runs on HPC cluster and that can process the growing transaction 
graph without encountering performance bottlenecks. 

The main innovation of the pilot lies in the applicability of HPC technologies 
to analyse Blockchain (huge) transaction graphs, to quickly detect possible frauds 
based on blacklists. 

Following the main components of the pilots and the partners in charge of the 
development: 


* Blockchain Transaction Dataset Preparation Component (developed by 
BOUN (Bogazici University)), 

e Scalable Transaction Graph Analysis Component (developed by BOUN 
(Bogazici University)), 

e User Interface for Blockchain Transaction Reports and Visualization Com- 
ponent (developed by AKTIF Bank). 


The final users will be banks who need to do analysis of blockchain addresses. 
Developed services can also be offered as a service to companies who need to do 
such checks, for example, companies that accept token payments. 

The first year aimed at massive blockchain dataset preparation, an HPC based 
cluster parallel transaction graph analysis system construction and coding of 
traversal-based graph algorithms. The second year will concentrate on machine 
learning based approaches for analysis using, in particular, the graph system devel- 
oped in the first year for feature extraction. 


Testbed 


Figure 3.56 depicts the testbed which is currently set-up and running on the Ama- 
zon cloud. The following is the hardware and software configuration that is used 
for the testbed: 


Hardware: 


* HPC Cluster on Amazon Cloud (16 c5.4xlarge instances), each instance 
having 16 virtual CPUs, 32 GiB memory and 500 GB SSD storage. 


* A medium Amazon instance for running message queue. 
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Software: 


* Ubuntu Linux operating system 

e StarCluster HPC cluster toolkit. 

e MPI message passing interface 

e Rabbit MQ message queue 

* Metis Parallel graph partitioner 

e Vis.js open source graph visualization software for web interface. 


Implementation of a first Proof of Concept 


The figure below shows the architecture of the Proof of Concept system that has 
been implemented. 


Expected Outcomes 


e Software that runs on hybrid CPU/GPU cluster; 

e Partitioned transaction graphs of the current blockchains easy to be managed; 

* Blacklist of hacked/fraudulent account addresses on bitcoin and Ethereum 
that is collected from public sources on the Internet 


Figure 3.56 shows the Pilot #9 workflow. The input data are Ethereum and 
Bitcoin blockchains and their nodes. The inputs are fed into blockchain transaction 
graph analysis which ultimately provides bank or business queries. 

Services to be implemented according the user stories. 

PoC currently offers parallel scalable blockchain transaction graph construction, 
parallel graph traversals that trace customer addresses to blacklisted addresses by 
returning the traced subgraph. Parallel Pagerank algorithm that finds important 
addresses is also offered as a service. The transaction graph can also be partitioned 
in parallel using the Metis software. These services are offered on the whole dataset 
graph having 633M transactions. 

Components implemented, interactions and deployment 


1. The following components of the project have been built as proof of concept: 
Blockchain Transaction Dataset Preparation Component. 

2. Scalable Transaction Graph Analysis Component. 

3. User Interface for Blockchain Transaction Reports and Visualization Com- 
ponent. 


Component (1) parses Ethereum raw data to extract transactions which are saved 
as files. (2) constructs the distributed and partitioned graph on HPC cluster using 
the transaction files and performs parallel graph algorithms. (3) communicates with 
(2) via RabbitMQ message service, submits queries and displays returned results on 
web page and produces graph visualization output using the Vis.js package. 
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Figure 3.57. Pilot #9: Analyzing blockchain transaction graphs for fraudulent activities 
workflow. 
Datasets 


The following data sources are used: 
e Public Bitcoin Blockchain Data (BOUN) 
Bitcoin transfers (send transactions); 
* Public Ethereum Blockchain Data (BOUN) 


— Ether transfers (send transactions) and ERC20 Token Smart contract transactions 
(major popular tokens including stable coins like EURS, GUSD, USDT, TRYB, 
PAX, TUSD, QCAD, XAUT) 


* Bitcoin and Ethereum Addresses Database (AKTIF) 


Database of all Bitcoin (within block ranges 0-674999) and Ethereum addresses 
(within block ranges 0-10199999) are maintained as a database with capability to 
label each address with features. 

Blacklisted Bitcoin and Ethereum blockchain addresses that are obtained from 
the Internet by manual search for published hacked/fraudulent accounts and 
addresses involved in ransomware activities. 


Data Produced 
The following data generated: 


e Extracted Ethereum and major ERC20 token transaction data that is also 
made available at https://zenodo.org/record/47 18440#.YXkLhtZBwl1. It 
can be downloaded by researchers and businesses; 
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e Paths and subgraphs that show tracing of blockchain addresses to blacklisted 
addresses; 

* Importance values of addresses computed by running parallel Pagerank algo- 
rithm is produced as data. This rank data can be used as an important feature 
in the machine learning algorithms. 


Explainable Workflow 


Pilot #9’s Blockchain Transaction Dataset Preparation Component parses raw 
blockchain data and extracts Bitcoin, Ethereum and major ERC20 token transac- 
tions (such as Gemini USD (GUSD), Tether USD (USDT), Tether Gold (XAUT), 
Statis Euro (EURS) and Turkish BiLira (TRYB) ) that come from the Bitcoin and 
Ethereum Mainnet blockchains. After retrieving all the blocks up until now, this 
component is run periodically to retrieve newly generated blockchain blocks during 
the period. 

Scalable Transaction Graph Analysis Component of the pilot takes the full bit- 
coin and Ethereum public transaction dataset. Graph traversal algorithms are used 
to analyze the data. Parallel graph traversals are used to extract features that are 
in the form of subgraphs. Since the transaction graph size is massive and dynami- 
cally growing, it constructs distributed and partitioned transaction graph in parallel 
using MPI message passing libraries in order to achieve scalability. Graph analysis 
service is interacted through a message queue that takes commands in YAML for- 
mat. The outputs of the service are in the form of graph paths or subgraphs that 
show tracing of Blockchain addresses to blacklisted addresses. In the second period 
of the project, machine learning algorithms have been started to be developed. Bit- 
coin and Ethereum transaction data and blacklisted address lists as well as pageranks 
that are computed in parallel are used in machine learning algorithms. 

Finally, the User Interface for Blockchain Transaction Reports and Visualiza- 
tion functional service interacts with the Scalable Transaction Graph Analysis and 
presents results in a web browser. When subgraphs are returned that trace customer 
addresses to blacklisted addresses, these subgraphs are output in vis.vj graph visu- 
alization software format for viewing in browsers. The business service is provided 
through a RabbitMQ message queue that takes commands in the YAML format. 
Visualization of transaction graph traces as well as a simple address score based on 
shortest path from blacklisted addresses is also provided. 


Logical Schema 
The following figure illustrates how the pilot architecture can be expressed in terms 
of the layers and the pipelines approach of the INFINITECH-RA. 

Platform for data gathering (Related Reference Architecture Layers (Figure 
3.58): Blockchain, Infrastructure and Data management). 


Predictive Financial Crime and Fraud Detection Pilots 115 


Block cham 


= =~, ADAYIXS User interaction 
N 
cn m Data Processing 
= Data Management , aid Architectare iteein | 
Transaction 
EX š mat: 7 


Figure 3.58. Blockchain transaction graphs analysis pilot pipeline in line-with the IRA. 


Blockchain dataset component is implemented as scripts that retrieve blockchain 
data as raw block data and parse these to extract crypto-currency and token trans- 
action. Sources of blockchain raw data are Cloudflare Ethereum Gateway, Google 
Bigtables and blockchain nodes. 

Big Data management (Related Reference Architecture Layers (Figure 3.59): 
Data Processing and Analytics). One cannot assume that massive blockchain 
data will fit in one computer node. Therefore, a distributed inmemory storage 
on an HPC cluster is essential. Currently, Scalable Transaction Graph Analysis 
Component which is implemented using C/C++ and MPI message passing libraries 
constructs a partitioned graph in parallel and provides big data management and 
processing capability. 

Statistics, analysis, AI (Related Reference Architecture Layers (Figure 3.59): User 
interface). 

In order to carry out analytics and report various statistics, two types of 
approaches are to be utilized (i) Graph Algorithms Approach and (ii) Machine 
Learning Approach. For Machine learning, K-Means, Support Vector Machines, 
Naive Bayes, Logistic Regression, Random Forest, Artificial Neural Networks (Mul- 
tilayer Perceptron) methods will be used by making use of the existing Python 
Scikit-learn and Pytorch machine learning software. Analytics layer, in Figure 3.21:, 
shows these functionalities. 

Readiness, matureness, level of development (TRL level). 

Currently, a proof of concept (PoC) implementation of the pilot is available. As 
a whole, the current level of development is at TRL3. On the other hand, indus- 
trially relevant environment for Pilot9 is defined to be an environment where the 
real world blockchain data is used. When carrying out our tests in Pilot9, we do 
use massive industrially relevant blockchain data. The eventual target TRL level 


is TRL7. 
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Figure 3.59. Layered architecture of the scalable blockchain transaction graph analysis 
system. 


Data Components 


The following components will be developed, deployed and used in the pilot: 


e Blockchain Transaction Dataset Preparation Component (Data ingestion in 
RA). 

e Scalable Transaction Graph Analysis Component (Data Management and 
Analytics in RA). 

* User Interface for Blockchain Transaction Reports and Visualization Com- 
ponent (Interface and Analytics in RA). A database of bitcoin and ethereum 
addresses as well as blacklisted addresses is also managed by this component. 


Conclusions - Issues and Barriers 


The first year of the Pilot9 has focused on (i) collection and parsing of public mas- 
sive blockchain data and (ii) design and development of a scalable parallel transac- 
tion graph system (iii) development of a simple web interface that would query the 
graph system and output visualizations of subgraphs returned. We concentrated 
mainly on Ethereum Mainnet blockchain data, because it was more challenging 
to deal with due to smart contract support. Code needed to be written to extract 
transactions from token contract calls. 
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Whereas implemented parallel algorithms for graph construction, Pagerank 
computation, tracing and extracting of subgraphs have been tested successfully, our 
parallel connected algorithms has an issue with it. It is working on small test graph 
with 1M transactions. But on the whole 633M transaction graph, it is taking too 
long and not possibly terminating either due to a bug or because the parallel algo- 
rithm coded is not efficient due to excessive communication. This will be fixed in 
the future by coding a more efficient algorithm. 

Even though there exists massive public blockchain transaction data and this 
data can be obtained easily by writing scripts, the same cannot be said for black- 
listed addresses. Publicly available blacklisted addresses had to be located through 
google searches by hand and extracted manually. Collection and tagging of black- 
listed addresses information remain as challenging issues because often this type of 
data may be private and not publicly available. 

For machine learning, we need data that can be used for training in our models. 
In particular, licitness and illicitness information about addresses are needed, but 
little information is available about this — just the roughly 4K Ethereum blacklisted 
addresses available from various sites on the web are available to start with. On 
the other hand, there are roughly 70 million addresses on the Ethereum Mainnet. 
Hence availability of illicit addresses is limited. Identities of owners of addresses 
are also not available. This is currently the biggest issue and barrier that we cur- 
rently have. However, the fact that a parallel cluster graph analysis system has been 
built means that we can do fast graph queries and traversals on massive data. As a 
result, we plan to tackle these challenging issues and barriers, by developing graph 
algorithms that provide information about licitness and illicitness. For example, 
Pagerank algorithm can be ran to find out important addresses. These addresses 
are more on the side of licitness since they are addresses of popular services like 
exchanges that are regulated. Since exchanges verify addresses of customers, then 
transactions going to addresses from such services are more likely to be licit since 
KYC/AML checks are carried out by exchanges. Hence, the graph traversal algo- 
rithms can be used to report features related to possible licit or illicit addresses in 
this manner without having information about the addresses in question. These 
extracted features can then be used in Machine Learning algorithms. 


Pilot #10: Real-time Cybersecurity Analytics on Financial 
Transactions' Data 


Pilot #10 aims to significantly improve the detection of cases of suspected fraud- 
ulent transactions, to enable the identification of security-related anomalies while 
they are occurring by the analysis in real-time of the financial transactions of a home 
and mobile banking system (Figure 3.60). The ability to detect anomalies faster (i.e. 
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Figure 3.60. Pilot #10 testbed logic. 


in real time) and to unveil potential hidden patterns of cyber-attacks are among the 
main innovations of the pilot. 

The use case envisages a pre-processing of transaction data and model training in 
a batch layer (to periodically retrain the predictive model with new data) while in a 
stream layer, the real time fraud detection is handled based on new input transaction 
data. 

A fraud detection system is proposed to meet two goals: 


* The early detection of new and subtle types of frauds. Since fraudsters keep 
innovating novel ways to scam people and online systems, it becomes crucial 
to apply AI/ML methods to detect outliers in large transactional datasets and 
be robust to changing patterns. 

* The reduction of the number of false positives which are usually analyzed 
to understand if they are real fraud attempts or not. To this aim, it is very 
important to be able to train, validate and test ML models to make the most 
accurate ones operational. 


Testbed 


With regards to the pilot #10 design and execution, the testbed definition (that is 
the setting of hardware resources, like Storage, Compute and Network...) aims to 
consider the deployment of an instance of ALIDA asset (Figure 3.61). 

The set of resources needed is described in the following picture: 

The infrastructure setup consists in a single-master four-worker nodes running 
a as kubernetes on-premise cluster, each node has a wide enough set of allocable 
resources to run the testbed safely and without running into disk pressure and mem- 
ory pressure issues for the expected workload. In any case, the system can be scaled 
both horizontally and vertically. The machines are equipped with 128 GB of RAM, 
2TB of storage and one octa-core 3.7Ghz processor. 

ALIDA is cloud native software, this means that can be seamlessly deployed both 
in an on-premise environments and on the cloud environments provisioned by the 
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Figure 3.61. Pilot #10 PoC (October 2020). 


widely known providers such as Microsoft Azure, Amazon AWS and Google Cloud 
Platform. 
To summarize, the software requirement to get ALIDA up and running are: 


* Helm and Tiller 2.144. 

* Kubernetes 1.144. 

* To enable ingresses, a valid ingress provider is required, Traefik is recom- 
mended. 

e A DNS service provider is recommended to use ingresses with Traefik. 

e A persistent volume provisioner support in the underlying infrastructure. 


Implementation of a first Proof of Concept 


Current implementation status on Pilot#10 is shown in Figure 3.62. 

PI create Synthetic and Realistic data set on “Bank Transfer SEPA” transactions 
that are consistent with the real data present in the data operations environment. 
These data sets are going to be used by Pilot #10 and, more in concrete, for the first 
PoC. To develop the services and workflows and ALIDA instance was deployed on 
ENG premise. As a Preliminary step: a job to transfer synthetic data set on “Bank 
‘Transfer SEPA” transactions from an SFTP server to ALIDA HDFS, was designed 
and it is up and running (Figure 3.63). 

With the data ready to be processed, and using ALIDA, a first Batch processing/ 
workflow has been created. This workflow converts qualitative fields into quantita- 
tive one, train a KMeans model and makes the clustering process. The Figure 3.11 
shows developed ALIDA workflow based on three steps (string-indexer, trains the 
data with a KMeans models and the clustering creation). 

After that, the data is grouped and visualized by clusters (Figure 3.63). Here a 
domain expert has to label which clusters would be suspicious of fraud. After that 
the Stream processing would start labelling and detecting new incoming data in 
real time. But this part is not implemented yet. 
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Figure 3.63. Pilot #10 clustering results. 


Expected Outcomes 


* Validation of specific systems, models and tools for the real-time analysis of 
big data. 

* Collection of quantitative evidences on the performances of the solution and 
indications on potential further improvements and potentialities. 

e Validation of a set of cyber-risk rating metrics. 


Figure 3.64 shows the dashboard and the workflow of Pilot #10: Real-time cyber- 


security analytics on financial transactions’ data. 
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Figure 3.64. Pilot #10: real-time cybersecurity analytics on financial transactions’ data. 


Datasets 


* Synthetic Financial flow Dataset. 
* Logs for Correlation and Security Analytics. 


The data sets in input ofthe batch workflow are related to several types of transac- 
tions: — Bank Transfer SEPA (The Single Euro Payments Area (SEPA): a payment- 
integration initiative of the European Union for simplification of bank transfers 
denominated in euro. SEPA covers predominantly normal bank transfers. 

A data generator, implemented by ENG, will simulate real-time transactions 
(SEPA) which includes informations about the emission date, the beneficiary and 
the orderer accounts, the amount, the IP address of the orderer’s connection and 
its location (futher informations on Explainable Worflow section). These data 
will be collected and stored on a dataset to later retrain machine learning mod- 
els batch-wise; at the same time they are analyzed at real-time for fraud detection 


with previously trained models. 
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Data Produced 


The data produced consist of a list of suspected fraudolent transactions with an 
associated probability of actually being frauds exstimated based on the models used. 


Explainable Workflow 


To meet the abovementioned goals, Pilot£10 envisages two layers (batch and stream 
layers) implementing the following ML pipelines: 


* Unsupervised training (batch) of an outilier detection model (Isolation 
Forest) on all the collected data. 

e Supervised training (batch) of a classifier on the data labeled by the domain 
expert user. 


Real-time detection (stream) of ouliers: which consists of both data preparation 
services and the application of the Isolation Forest model. 

Real-time detection (stream) of fraudolent transactions using the supervisely 
trained model. 

For the first training the goal is to try to identify the outliers using the collected 
data over time and a method called Isolation Forest, an unsupervised teqnique that 
identifies anomalies isolating points in a n-dimensional space using binary trees. 
These points are not necessarily fraudolent transactions, but assuming that the ille- 
gitimate ones are a very small percentage of the whole dataset, it is likely that they 
are as well outliers; therefore it is important to collect outliers and make them avail- 
able to the domain expert for further analysis. While analising them, he will also 
label the data, distinguishing between true positive and false positive fraudulent 
transactions. 

The second training consists of the generation of a supervised classifier model. 
Since the domain expert labels a portion of the data at real-time, and those are 
collected in batches, we can exploit the work done so far in order to offer a second 
exstimation of the probability of the transaction being illegal or not. This second 
exstimation is going to help the system in filtering what transactions the domain 
expert must analyze and what are the ones he can ignore, reducing his work but at 
the same time tring not to reduce the reliability of the fraud detection mechanism. 

At the same time a real-time analysis is needed. Before the real-time detection 
the data pass through a process of cleaning and filtering in order to create new fea- 
tures that will be more useful in the predictive model or to enhance other features, 
improving model performance. We are supposing that it will be needed to analyze 
datasets made of mixed-type data, where numeric and nominal features coexist. 
These data must be then elaborated: e.g., instead of dates, time intervals might be 
more interesting; instead of user names or IP addresses, their location might be 
more useful during the model’s training. 
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The real-time detection of outliers consists on the application of the unsuper- 
vised model described before; at the same time the supervised model is used for 
inference and a second estimation of the probability that the transaction's data 
belong to a fraud is produced. The results of both predictions are analized by the 
domain expert which can take action and block the transaction in time, while at the 
same time all the data produced are collected, including the labels generated. These 
data are used on the next cycle of batch tranings, improving models performance 
over time. 

For the Pilot £10 aims, ALIDA (https://home.alidalab.it/) is adopted and 
extended to design Big Data Analytics (BDA) services batch and stream workflows. 
In a nutshell, ALIDA is a micro-service based platform, developed by ENG (Engi- 
neering), for composition, deployment, execution and monitoring of workflows of 
BDA services; it is entirely developed with open source technologies. 

ALIDA offers a catalogue of BDA services (for ingestion, preparation, anal- 
ysis, visualization), implemented as Spring Boot Applications and deployable as 
docker images. User designs his own (stream/batch) workflow by choosing the 
BDA services from it, indicates which Big Data set he wants to process, launches 
and monitors the execution of the workflow and personalizes the results visu- 
alization by choosing from a set of available graphs. All this without worry- 
ing about having software developer skills or particular knowledge on big data 
technologies. 

Some BDA services for preparation and machine learning, as KMeans and Ran- 
dom Forest modelling and prediction, are already available within the ALIDA Cat- 
alogue. Even though they need to be reviewed (and in some cases redesigned) to 
meet specific pilot requirements. 

Concerning the remaining BDA services (especially pseudo anonymization one) 
the pilot will make use of the services made available within the project. 


Preliminary step: 


To load data sets related to several types of transactions (SEPA bank transfer, for- 
eign bank transfers, internal transfers of funds, PCTU, SMWCA, STFTS) into the 
HDEFS storage of the ALIDA instance, by means an ingestion job. 


Batch processing, building and labelling clusters (training): 


Stored data sets are properly filtered (to remove some columns and rows unneeded 
for the ML) and joined to get only one unlabeled data set to be used for the unsu- 
pervised machine learning. 

In this phase the goal is to cluster such data, to create labeled samples to feed the 
supervised machine learning classifier of the next phase. Clustering process groups 
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the data according to automatically detected similarities. These clusters/groups still 
need a domain expert, PI (Posteltaliane), who determine which clusters present a 
fraudulent behaviour and properly assign labels to such clusters. 


Stream processing: 


After learning the mapping, the Random Forest (RF) classifier can map new real- 
time unlabeled transaction data to their corresponding high-level information (i.e. 
label) on the basis of the model trained in the batch layer. In that way, financial 
fraud events can be detected while happening. 


Logical Schema 


An initial mapping of the pilot's components and modules to the INFINITECH- 
RA pipelines approach is illustrated in the following figure. 

Figure 3.65 shows a logical view of the components identified for the Pilot#10 
according to the mapping with the INFINITECH Reference Architecture. The list 


of main components to be deployed and used in the Pilot£10 includes: 


* Identity Management System: It is a cross cutting system that guarantees user 
authentication, authorization and management. It allows or denies the access 
to the federated services that run within the architecture. 

* Role Management: It implements and handles the roles and the privileges 
that can be associated to the users. It is often tightly coupled with the Identity 
Management System. 

* Message Broker: works as an intermediary software that allows system compo- 
nents to communicate each other effectively, implementing a common com- 
munication protocol over message buses. 

* Resource Manager: It is a lower-level software that consents to handle and 
use the infrastructure resources seamlessly and dynamically, according to the 
number of requests received per time interval. 

* Pseudoanonymizer: tool to pseudonymize personal or sensitive data at source, 
in order to preserve privacy according to GDPR regulation. 

e Filter: Filtering component to remove specific rows and columns. 

* Join: Service to join two or more datasets where at least one column must be 
the same. 

e OneHotEncoder: Service to transform categorical variables into numerical 
ones. 

* Clustering (Kmeans): Given a set of observations (x1, x2, ..., xn), where each 
observation is a ddimensional real vector, k-means clustering aims to partition 
the n observations into k (<n) sets S = (S1, S2, ..., Sk} so as to minimize the 
within-cluster variance. 
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e Random Forest: An ensemble learning method for classification, regression 


and other tasks that operate by constructing a multitude of decision trees at 
training time and outputting the class that is the mode of the classes (clas- 
sification) or mean prediction (regression) of the individual trees. Random 
decision forests correct for decision trees’ habit of overfitting to their train- 
ing set. 

Results: Storage that contains all the processed data elaborated by the work- 
flow. 

Visualization: The service that gathers the resulting datasets to be delivered 
to the visualization clients. 


Components 


The list of main components to be deployed and used in the pilot includes: 


* Filter: Filtering component to remove specific rows and columns. 
* Join: Service to join two or more datasets where at least one column must be 


the same. 


e Prelaboration: Service to transform categorical variables into numerical ones 


throgh different calculation. 


e Outliers detection (Isolation forest): Given a set of observations (x1, x2, ..., 


xn), where each observation is a d-dimensional real vector, isolation forest 
associate to each Fraudolent transaction detection: exploiting a supervised 
classifier algorithm (e.g., random forest classifier, neural network classifier), 
classify incoming data in two categores: suspected frauds or clean transac- 
tions. observation a value that expresses how much it differs from the distri- 
bution calculated on each dimension. 

Fraudolent transaction detection: exploiting a supervised classifier algorithm 
(e.g., random forest classifier, neural network classifier), classify incoming 
data in two categores: suspected frauds or clean transactions. 

Results: Storage that contains all the processed data elaborated by the work- 
flow published to be visualized. 

Visualization: The service that gets the resulting datasets to be delivered to 
the visualization. 


Conclusions - Issues and Barriers 


Current setup clearly demonstrates some of the most relevant capabilities of the 


pilot: 


* availability of a significant dataset for analysis 
* data collection from source and preparation 
* data ingestion 
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* AI model training 
* prediction 
* data visualization 


Some pre-processing services are supposed to be needed in both batch and stream 
stages before the ML algorithms are invoked. They will be implemented once the 
data schema on transactions will be defined. 

In order to fulfil GDPR requirements both at pilot stage and in potential 
production stage with real transaction data, a fully synthetic dataset will be also 
pseudonymized at source. Currently, synthetic data are pseudonymized at gen- 
eration time, therefore data analysis will work on pseudonymized data, but we 
expect a pseudonymization tool will be made available in the framework of the 
INFINITECH project for potential production use with real data. 


Expected Business Impact of Technologies adopted in this Pilot 


Frauds on financial services are an ever-increasing phenomena and cybercrime gen- 
erates multi-million revenues, therefore even a small improvement in fraud detec- 
tion rates would generate significant savings. 

This viewpoint, built on information sharing activities currently running in the 
banking sector, is also reinforced, and strengthened by trusted industry reports. 
With some surveys and reports pointing to issues, such as: "recover less than 25 
percent of fraud losses", "Increase fraud typologies globally, from recent years, 
include identity theft and account takeover, cyber-attack, card not present fraud 
and authorized push paymentsscams", “6 is the average number of frauds reported 
per company studied", “56% asked companies investigated their worst fraud inci- 
dent. many organisations are failing to respond effectively". These, and other issues 
in these reports, demonstrate the importance of developing new technologies and 
approaches, such as real time analytics, to enhance the need of fighting against cyber 
frauds. 


Pilot #16: Data Analytics Platform to Detect Payments 
Anomalies Linked to Money Laundering Events 


Nexi, as the Italian paytech leader, owns and manage a large, big data ecosystem, 
which includes information regarding cardholders, merchants, organizations, and 
digital payment authorizations and transactions. The pilot will build a data analyt- 
ics platform to help Nexi AML team to discover, monitor and analyze suspicious 
scenarios related to money laundering through digital card payments. 

The pilot purpose is to preside anomalous scenarios linked to money laundering, 
adhering to European AML regulatory compliance policies, by notifying detected 
cases to the Italian Financial Intelligence Unit (FIU). The innovation potential of 
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current pilot lies in introducing novel technologies like, machine learning, artificial 
intelligence, graph database to detect anomalous scenarios, which allows to auto- 
matically detect complex anomalous money-laundering scenarios. 

The adoption of pilot platform will improve quality and efficiency of AML users 
work and, at the same time, will concur in reducing risk of unmatched scenarios 
related to money laundering events. 


Data Sources 


The following data sources will be integrated and used in the pilot: 


e Cardholders transaction operations. 

e Cardholders information registry. 

* Merchants transaction operations. 

* Merchants information registry. 

e AML Anomalies Features Store. 

* AML Suspicious Activities Report (SAR) practices collection. 
* Master and reference data. 


All above-mentioned data sources are in an anonymized format and are stored 
and collected into a Data Lake environment to enable agile development and pro- 
cessing. 

The pilot will use a graph database, to model many-to-many relations that 
belongs to anomalous events linked to money laundering; thanks to this technol- 
ogy we can find out any relationship occurring between a suspicious payment events 
and individuals or merchants. 


Data Produced 


The three data outputs produced during the pilot are: 


* Anomalous subjects. 
* Cluster of anomalous subjects. 
* Anomaly risk score for each subjects. 


Decision rules developed into the graph database, based on many-to-many asso- 
ciations, will produce periodically (monthly or quarterly accordingly) anomalous 
subjects (1) or groups, clusters, of anomalous subjects (2) intercepted. 

During the pilot we will develop an algorithm that update anomaly risk scores 
(created with machine learning supervised classification algorithm) associated to 
any subject: those updated score is the last data product generated (3). 

Subjects can be, cardholders, legal entities, organizations, legal representatives. 
Any sensitive information, such as person ID, is anonymized so that it's not trace- 
able to physical or legal person. 
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The output format will be a csv text or a parquet file stored into the Data lake. 
Those formats allow not to be bounded to any particular database technology. In a 
post-processing phase, we can then accordingly choose how to use and model those 
data: a relational database to perform SQL queries and to be the back-end of a data 
visulization tool, or just a plain .csv file to perform exploratory data analysis with 
data science frameworks. 


Explainable Workflow 


The data workflow considers all steps and data shapes needed to develop a ML 
solution to find anomalous scenarios : the collection of data (listed in the previous 
paragraph), the processing step to create a training, validation and test set, the train- 
ing of risk score with ML algorithms and the presentation layers to communicate 
results. 

As a first step, we collect in anonymized format historical data about Nexi 
clients behaviour, such as transaction payments, withdrawals, merchants informa- 
tion, reversals, money transfer, past SAR reported to Italian Financial Unit (IFU), 
into a Data Lake. 

Afterwards, we apply the Transformation step of a typical Extract, Load, Trans- 
form(ELT) workflow to create the Feature Store ; that is, a dataset containing 
Machine Learning features for each cardholder (or organization) together with the 
target variable, the outcome of the ML algorithm, that in this case is binary variable, 
representing whether a cardholder has been notified to IFU. It is updated monthly 
in batch mode. 

Once the Feature Store is ready, we follow these steps: 


* Training machine learning model and, based to the predictions generated, we 
get the anomaly risk score for each cardholders (or organization). 

* Create a graph database, inserting cardholders, organizations, legal represen- 
tatives behaviours data (from both Data Lake and feature store) and ML based 
risk score. 

* Perform a Personalized Page Rank algorithm to adjust risk scores, taking into 
account the many-to-many relationships modelled with graph data struc- 
tures. 

* Define rule based anomaly events detection to find customers of groups of 
customers (clusters) to whom AML users would pay attention. 


All the steps mentioned above are then stored into the Data Lake in a file format 
(.csv) or compressed file like parquet, and then modelled into relational database 
tables or views to make those accessible to analytics users. 

Finally, a visualization dashboard allows to end users (Customers Due Dili- 
gence team members) to explore and visualize outputs of the data workflow, 
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Figure 3.66. Pilot #16 pipeline in-line with the IRA. 


and so can discover and analyse riskier subjects as suggested by algorithms 


developed (Figure 3.66). 


Logical Schema 


The explainable workflow of the pilot to the INFINITECH-RA layers is in the 
following figure. 


Components 
The following main components will be deployed and used in the pilot pipelines: 


* BigData Management Layer to collect and process data (Data Management 
in RA). 

* Money Laudering Risk Prediction supervised classification model and Graph 
database engine to adjust risk scores (Analytics and Machine Learning in RA). 

e Visulaization dashboard of customers wirh higher risk of money laundering 
event (Visualization in RA). 


3.5 Smart, Reliable and Accurate Risk and Scoring 
Assessment 


Pilot #1: Invoices Processing Platform for a More Sustainable 
Banking Industry 


The main objective of the pilot is to develop, integrate and deploy a data-intensive 
system to extract information from notary invoices, in order to: (i) Establish the 
sustainability index of each notary based on the number of physical copies that 
are issued. (ii) Provide to financial institutions the information (properly indexed) 
about the documents that are finally generated by notarial services required by the 
bank. (iii) Promote notarial services from those with the higher sustainability score. 
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The innovation of the pilot lies in the applicability of Artificial Intelligence tech- 
nologies over scanned physical documents (notary invoices) for cost savings and 
increased effectiveness. Currently, many physical documents, and copies (some of 
the redundant), have to be managed. Each physical copy and its control cause sig- 
nificant costs over the period of the financial products lifetime. AI can be leveraged 
to extract relevant indicators from digitized invoices, which in turn can be used to 
automatically and accurately rate notaries based on a sustainability index. 

Following a list of partners participating in the pilot and their different 
roles/contributions: 


* Bankia’s Auditing department provides the business use case, the functional 
requirements, and the expert knowledge about entities to extract, business 
rules, alert generation and information dashboard. It also provides the cloud 
environment for the deployment of the storage and computation platform. It 
provides the expert knowledge for invoice tagging and validation of the final 
product. 

e GFT provides the architectural design, platform implementation, algorithm 
design, training, validation and implementation, together with the document 
pre-processing. GFT carries out the development of the different compo- 
nents. 

e Final users are Bankia’s internal auditing department, where a real task will be 
automatized, thus obtaining a real return of investment and key performance 
indicators. 

* In the pilot will also collaborate: FBK (Fondazione Bruno Kessler) as the data 
science expert, RB (ReportBrain) as solution expert with expertise in text- 
analytics and sentimental analysis and INSO (Insomnia Digital Innovation 
Hub) as business and development advisor. 


Due to the modular nature and close interaction between technology and busi- 
ness actors, the solution has been showcased to different European and US poten- 
tial customers, that have showed a keen interest. Feedback is that, with adaptations 
with regards to their business workflows and retraining for their documents, both of 
which is completely feasible, this solution will constitute a compelling technology 
and business case. 

The technology developments have an impact on the implicit training of differ- 
ent actors: 


* Business users that have been training in the statistical nature of the results 
coming from such tools, and therefore the interaction and cooperation 
human-AI tools. 
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* Business managers to assess the impact and specificity of the use of such tech- 
nologies. 

* Data scientists and software architects and developers that have been trained 
during the development of the present project. 


Technological components and Services 


The main technological components that will be implemented and integrated as 
part of this pilot are: 


* Invoices and invoicing workflow database. 

* Document ingestion. 

* Document pre-processing: document pagination, PDF to image conversion, 
image normalization, OCR. 

* Document entities and region-of-interest extraction: machine learning mod- 
els and Natural Language Processing extractors for the identification and 
extraction of entities of interest: billable concepts, prices, headers, addresses, 
etc. 

* Entity association: graph deep neural networks for the identification of 
related concepts: e.g. that a certain billable concept corresponds with an iden- 
tified price and identified. 

* Business rules engine: application of compliance business rules for the gen- 
eration of alerts and reports. 

* Data Tagger: for the tagging of training invoices examples. 

* Document validator: for the verification of processed invoices. 

* Training and inference orchestrated pipelines. 

* MLOps tools: Models and data Repository, code repository. 


Testbed 
A cloud-based testbed will be implemented using AWS Bankia Private Cloud, with 


and estimated volume of data of 2TB. The test bed is already available, and the 
hardware to be used will depend on the task to be accomplished: 


* For training and tagging AWS EC2 instance of the type g4dn.xlarge with 
200 GB of disk with GPU. 

* For inference, normal computing optimized instances c6g.2xlarge or the 
same type g4dn.xlarge, with the AWS Deep Learning AMI (Ubuntu 18.04). 


Some more details about different tools and software components to be deployed 
follow: 


* Data management: linux file system, S3, elastic search. 
* Data processing: kafka, Kubeflow. 
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* Data analytics and AI related tools: tensorflow 1.5, sklearn, pandas, numpy, 
seaborn. 

* Data tagging: labelme. 

* Data visualization: kibana, Floent, Prometheus. 


Implementation of a first Proof of Concept 


This first Proof of Concept is addressed to the implementation of a two-sided 
machine-learning based system at scale. 

From one side, to automatize the capture and extraction of the unstructured 
information in scanned documents using computer vision and machine learning 
deep neural networks. This implies to develop, integrate and deploy a data-intensive 
system to extract information from notary invoices to establish a sustainability 
index of notary services based on the number of physical copies issued, that will 
be used by the bank. 

From the other side, to capture the business rules expressed as concept associa- 
tions (e.g. invoiceable concept + related quantity + related price) that in an unstruc- 
tured way are scattered along the document with high variability. Finally, automa- 
tizing the whole process at scale coupling with automated workflows for document 
capture and reporting in a real financial institution environment. 


Expected Outcomes 


In terms of technical results, the following components will be developed, inte- 
grated, deployed and operated: 


* A batch processing architecture to process notary invoices. 

e A computer vision system to identify and extract tables. 

e An Al/Machine Learning system to extract information from tables. 
e A visual console to show results and the sustainability score. 


Datasets 


e Real Invoices. 
* Physical copies. 


Data from 32.300 real invoices documents and from 3.000 different notaries 
extracted from Bankia systems are the source of the Pilot. Invoice documents to be 
digitalized in PDF format or may also arrive already digitalized from other channels 
(email attachments, bulk sftp, etc.). Data type will be: PDF/Image/Text. Data for- 
mat will be: PDF/ PNG/ TXT. Estimated data volume will be: 2 TB. The dataset 
TableBank, which consists of 500.000 documents, will be used as Table Benchmark 
for Image based Table Detection and Recognition. 
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Data Produced 


Digitization of contracting and invoicing processes will allow an automated analysis 
of the digitized documents enabling a smart and autonomous scoring of notary 
services. Rating notaries based on a "Sustainability Index Score" will provide a new 
criterion to be applied when contracting these services impacting positively in the 
short and long-term in the amount of paper used and the economic fees applied. 


Explainable Workflow 


Invoice documents will be securely storage in a data lake. The system will paral- 
lelize different jobs to pre-process, process and post-process the documents and 
the outcomes. For instance: Image preprocessing (cropping, adjusting brightness, 
contrast, etc.); converting PDF to Text; OCR; text correction. A computer Vision 
system will identify and extract tables from invoices that will allow extracting sen- 
sible information to establish a sustainability scoring. And using machine learning 
we will extract information from the identified and extracted tables. The extracted 
information will be displayed so it can be validated and re-introduced to the sys- 
tem. The AI models will be trained (offline process) with a combination of public 
huge datasets and specific invoices samples. Trained models will be published to 
the runtime processing time after an expert evaluation. 

Figure 3.67 shows interactions and workflow, from high level point of view, 
between the main components. Invoices are automatically ingested by the system 
to start the processing, yielding the OCRed document, together with the extracted 
fields, the association between the corresponding fields, and the application of the 
rules. Later, results will be stored in the ElasticSearch database and finally, the sum- 
mary accessible by a dashboard (Figure 3.68). 


Logical Schema 


The following figure illustrates the logical architecture of the pilot in-line with 
INFINITECH-RA constructs and approach. 


Document Digitalization Al Platform User Interface 
=--> 
e BigData Tech * NUP 
@ Neural Netwoks 
e Computer Vision 


Figure 3.67. Pilot #1 main components interactions. 
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Figure 3.68. Invoices processing pilot pipeline in-line with the IRA. 


Components 


The main technological components that will be implemented and integrated as 


part of this pilot are: 


Invoices and invoicing workflow database. 

Document ingestion. 

Document pre-processing: document pagination, PDF to image conversion, 
image normalization, OCR. 

Document entities and region-of-interest extraction: machine learning mod- 
els and Natural Language Processing extractors for the identification and 
extraction of entities of interest: billable concepts, prices, headers, addresses, 
etc. 

Entity association: graph deep neural networks for the identification of 
related concepts: e.g. that a certain billable concept corresponds with an iden- 
tified price and identified. 

Business rules engine: application of compliance business rules for the gen- 
eration of alerts and reports. 

Data Tagger: for the tagging of training invoices examples. 

Document validator: for the verification of processed invoices. 

Training and inference orchestrated pipelines. 

MLOps tools: Models and data Repository, code repository. 

Reporting business dashboards and operational databases. 


Conclusions - Issues and Barriers 


End-users from the auditing department have been involved in the development 


following an agile methodology. Their roles has been crucial in the: 


(1) Validation of the information to be extracted and the definition of the 


ground truth for the document samples. 
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(2) Hlicitation of the operation workflow and reporting dashboards. 
(3) Definition of the business rules for the concepts association. 


Mismatch of end-user expectations and requirements with the actual project 
implementation has been addressed by the active involvement of the users in the 
bi-weekly review meetings. 

At the present time, the critical path consists in the coordination and adjustment 
of the different elements of the pipeline, a characteristic typical for projects with 
intensive use of machine learning algorithms that result in the combination of many 
moving parts. 

The main barriers like the availability of data, tagged data and expert knowledge 
for problem definition are mainly removed at the present time. 


Pilot #2: Real Time Risk Assessment in Investment Banking 


The pilot will implement a real time risk assessment and monitoring procedure for 
two standard risk metrics — VaR (Value-at-Risk) and ES (Expected Shortfall). Both 
can be applied for measuring various types of risk, above all, market risk of portfo- 
lios of assets. The pilot will implement both risk metrics for estimating market risk 
and allow updates with changing market prices and/or changes in the bank's port- 
folio in (near) real time. In addition, it will implement the evaluation of what-if- 
scenarios allowing pre-trade analysis, i.e. estimating changes in risk measures before 
a new trading position is entered. Moreover, the pilot will implement a sentiment- 
based decision support indicator derived from financial and economic news data 
and social media channels. While VaR and ES are quantitative risk measures based 
on numerical price data, the market sentiment will be derived from financial and 
economic news data and social media channels. 

The aim of this use case is to give traders in investment banking a precise and 
timely indication of the risk ofa given portfolio and specifically changes in risk due 
to market changes or changes of the portfolio. The need of such knowledge comes 
from operational as well as supervisory requirements that every regulated financial 
institute must comply with. 

Risk assessment is based on a common risk metric — Value at Risk (VaR) — to 
be calculated and updated in real-time on both, portfolio level as well as for each 
individual asset. A second risk measure — Expected Shortfall (ES) — indicating not 
the maximum amount of a potential loss, but the expected loss with a given prob- 
ability, will be derived at a later stage. The pilot will furthermore implement the 
evaluation of what-if-scenarios allowing pre-trade analysis, i.e. estimating changes 
in risk measures before a new trading position is entered. In addition, the pilot will 
implement a sentiment-based decision support indicator derived from financial and 
economic news data and social media channels. 
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The pilot will support institutional traders, asset managers, risk managers and 
wealth management experts in: 


e Calculating the Value-at-Risk (VaR) of their Portfolios. Emphasis will be paid 
on FOREX (FX) portfolios? , yet the system will be applicable for other types 
of portfolios as well. 

* Evaluating what-if scenarios for alternative Portfolios based on their VaR. In 
practice the system will simulate alternative investment strategies and will 
provide relevant information to the end-users to allow them to shape their 
investment decisions. 


The main innovations of the pilot lie in: 


* The calculation of VaR at very short timescales based on the processing of 
high-ingestion data. 

* The employment of ML-based VaR calculation techniques that will yield 
more accurate values and will facilitate traders in better understanding and 
framing the risks of their portfolios. 


Technological components and Services 


The components to be implemented are depicted in Figure 3.69: Pilot #2 Data Sci- 
ence Pipeline, which illustrates the data science pipeline for the pilot. They include: 

Data ingestion component. Ensures the ingestion of real-time data in the 
database of the pilot (XLS database). It is destined to cope with the high inges- 
tion rates of the real time data. 


* Market Sentiment component. Extract market sentiment for specific assets 
of the portfolio and provides this information to the data to reinforce the 
accuracy of the VaR calculation and/or to provide alternative methods for 
VaR calculation. The component is not implemented in the early Proof of 
Concept that is described in this deliverable. 

e VaR calculation component(s). Scientific Computing and Machine Learning 
components, which calculate the VaR of the portfolio based on different 
methods (e.g., historic method, variance-covariance, monte carlo simula- 
tion). They harness data from the LXS dataset. 

e End-User Dashboard component. Provides user friendly visualization of the 
VaR parameters for different portfolios owned by the user. 

* Semantic Interoperability component. Provides an interface for access to 
FIBO data, while supporting their parsing. The semantic annotation and 
the structuring of the data according to FIBO that is performed and hence 
the relevant description is beyond the scope of this deliverable. 
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Figure 3.69. Pilot #2 data science pipeline. 


The pilot pipeline conforms to the INFINITECH-RA specification i.e. 
the various blocks are structured according to the modules and layers of the 
INFINITECH-RA. Likewise, the deployment of the components adheres to the 
guidelines of the INFINITECH reference testbed. 

The implemented pilots deployment diagram for the first proof of concept can 


be seen in Figure 3.70. More specifically: 


The main elements of this deployment are: 


o LXS Database (Docker container) containing Historical Ticker data. 

o predict var (Dockerized python scripts) for time series pre-processing and 
VaR prediction 

o visualize var (Dockerized Flask Web application) to visualize FOREX 
assets historical statistics, VaR predictions and perform What if Analysis. 


New ticker data (test set) are injected from a csv file to the predict. var docker 
using kafka in between. In the next version LXS DB will be used instead of a 
csv file. 

predict. var docker reads once historical data from LXS DB to be used as a 
training set for VaR calculation. As new ticker data is created (from the test 
set) the training set is updated in predict. var docker. 

The predicted results are written back to the LXS DB. 

visualize var read predictions from LXS DB to update dashboards 
dynamically. 


Other non-technical requirements 


At later stages of the pilot experiments with more data will be carried out, based 


on access to data from other trading platforms (e.g., Forex platforms that provide 


APIs for different assets). Likewise, the estimations for the open source datasets to 


be used are subject to revision. 
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Implementation of a first Proof of Concept 


The Proof-of-Concept implementation comprises the following components in- 
line with the pilot data science pipeline (Figure 3.71): 


* Data Ingestion Component implemented within LXS database. 
e Scientific computing components in Python that calculate the Value-at-Risk 
(VaR) using three different methods, namely: 
o The Historical Method: This is probably the simplest VaR calculation 
method. It relies on significant volumes of historical market data (e.g., typ- 
ically one trading year data for conventional assets and much more than 
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Figure 3.71. Pilot #2 Dashboard for parameter configuration and visualization. 
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that for hedge funds) to calculate the price changes for all the assets of the 
portfolio. Accordingly, it calculates the value of the portfolio for each one 
of the price changes i.e. the value of the portfolio is simulated many times 
in-line with the number of price changes in the historic data (e.g., approx. 
250—260 times for one trading year). These simulated/estimated values 
for the portfolio can be sorted and used to form a distribution. Then the 
VaR at a given confidence level (e.g., 9996) is computed as the mean of 
the simulated values minus the lowest values (e.g., 196 lowest value for the 
99% case) in the series of simulated portfolio values. 

o The Variance-Covariance Method: This is also called parametric method. 
It assumes that returns follow a normal distribution, which is a simplistic 
yet acceptable assumption during normal market conditions. Given this 
assumption two parameters can be computed i.e. an expected return and 
a standard deviation for the portfolio. In case of a portfolio with many 
assets, the standard deviation should consider the correlation in the price 
changes of the different assets. The latter requires the computation of the 
covariance matrix of the various assets (i.e. the correlation coefficient of 
the assets). Based on the mean and the variance of the portfolio its value 
distribution is calculated and the value at the 9596 or 9996 confidence 
interval is produced. The method works quite well when there is a large 
sample size for the assets of the portfolio, as well as when the distributions 
of the asset prices are known. 

o The Monte Carlo Method: This method develops randomly scenarios for 
the future price of the portfolio based on some non-linear pricing models. 
Accordingly, it creates the distribution of these future prices and takes their 
losses at the target confidence interval. The method is more reliable when 
dealing with complex portfolios and complicated risk factors. Its advantage 
compared to the first two methods is that it is not restricted to scenarios 
seen in the past, but may also consider scenarios more extreme than those 
contained in the historical data due to its random component and thus is 
expected to be more realistic. 


* The visualization dashboard, which displays VaR Charts for each one of the 
three methods and two confidence intervals (9596, 99%)5 . A snapshot of the 
charts of the dashboard is depicted in Figure 3.71. 


A comparative visualization view of all three methods with different parametriza- 
tion and their development over time on a daily basis as depicted in Figure 3.72. 

The Proof of concept leverages reduced versions of the “Trade Data” and “Tick 
Data" datasets i.e.: 
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Figure 3.72. Pilot #2 charts for three VaR calculation method. 


* The Trades comprise 3 popular FX assets. The scale-up of the Proof-of- 
Concept will support more complex portfolios. 

* The Tick Data comprises historic data about the corresponding Forex assets 
in the period March 2020 — October 2020. It is considered a sufficient dataset 
for the Proof-of-Concept and the validation of the various methods. How- 
ever, the scale-up of the pilot will use more data and will experiment with 
different historic windows. 


Expected Outcomes 


The pilot will implement a real time risk assessment and monitoring procedure for 
three standard risk metrics (Figure 3.73): 


e VaR (Value-at-Risk). 
* ES (Expected Shortfall). 
e Pre-trade analysis. 


Datasets 


Data will be extracted from several data sources: real-time market data, histori- 
cal market data, synthetic electronic order platform (trades data), and financial 
news/article data. The pilot will leverage FOREX (FX) data provided by the JRC 
Platform and other trading platforms via Forex APIs. The data will include: 


* Trade Data (i.e. data with the assets’ positions) of the user that will be used 
to define the portfolio(s) of the user and their VaR/ES; 

* Tick Data (i.e. Historical market data) that will be used in the different 
methods for VaR calculations including standard methods such as Monte 
Carlo simulations, VarianceCovariance, Historical Simulation and a novel 
one based on deep neural networks, the socalled DeepVaR. 
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Figure 3.73. Pilot #2: real-time risk assessment in investment banking workflow. 


* Alternative data (e.g., data from news feed) that will be used for market sen- 
timent analysis based on NLP (Natural Language Processing Techniques). 
Such data will be obtained from Open APT’s (e.g., Google News API, Twitter 
API and Interactive Brokers API). 


Trade Data and Tick Data will contain information such as: the name of the 


instrument in FOREX trading (ex. GBPUSD for the exchange of GBP to USD), 
Timestamp that denotes when the trading took place, the Quantity and the Closing 


Price. 


The main data computed and produced include the VaR (Value-at-Risk) and ES 
(Expected Shortfall) estimations. In addition, the injected real time data are both 


processed and saved as historical market data as it is (ticker data) and processed 


(i.e., aggregated market data in frequencies of 1 min, 5 min, 1 hour, 1 day). More- 
over, the pilots sentiment-based decision support indicator derived from finan- 


cial and economic news data and social media channels will produce a sentiment 


score (positive, neutral, negative) for each article/description coming from the news 


feed. 
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Figure 3.74. Real-time risk assessment pilot pipeline in-line with the IRA. 


Explainable Workflow 


Data from the real-time market database and the news feed databases is injected 
into the Data Management layer through a stream processing component which is 
capable of handling large volumes of data that feature very high ingestion. The real- 
time data is initially concatenated with the historical data and then is appropriately 
transformed using a data windows component (i.e., the Online Aggregates Com- 
ponent), creating segments of time series. Data from the electronic order platform 
are managed using a data extractor. These data will also serve as input for both the 
correlation matrix and the scenario specifications components. The processed mar- 
ket data (historical and real-time) will then feed the correlation matrix component 
together with the processed data from the electronic order platform database. The 
correlation matrix processes and calculates the ingested data, merging the different 
data sources. The output will then serve as input, together with the scenario spec- 
ifications component, for the scenario generation, the basis for the Monte Carlo 
simulation. The processed data will then go into the Analytics component where 
VaR/ES estimation takes place. 

On the other side, data from the news article database are processed using the 
text extraction component and then market sentiment extraction one. Therefore, 
sentiment and behavioural analysis will be performed, serving as well as input for 
the Analytics component (Figure 3.74). 

The analytics component will perform calculations on the data from the above- 
described flows and from the inputs of the configurator. The latter involves inter- 
action of the user, in order to configure specifications for the scenario generation. 

The results are depicted in the User Interface which is responsible not only to 
visualize the VaR/ES predictions but also to perform pre-trade analysis leveraging 
the developed risk assessment models. 
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Logical Schema 
Components 


The workflow leverages the following components: 


* BigData Management Layer i.e. INFINISTORE and Online Aggregates 
(Data Management in RA). 

* Custom Injection Simulator (Data Ingestion in RA). 

* Kafka (namespace Cross Cutting in RA). 

e Zookeeper (Cross Cutting in RA). 

* AI model for VaR prediction (Analytics and Machine Learning in RA). 

* UI Risk Assessment based on VaR (Interface in RA). 

e Sentiment Analysis for financial news (Analytics and Machine Learning in 
RA). 


Conclusions - Issues and Barriers 


The implementation progresses smoothly in terms of its BigData and data analytics 
parts. Nevertheless, the datasets used are still quite limited. Furthermore, the pro- 
totype of the market sentiment component is not available. Likewise, the NOVA 
testbed is not fully operational. These are two of the main risks that have to moni- 
tored and cleared in the coming months i.e. within the period M13-M18, so as to 
ensure that the full-scale implementation is on track. 

This pilot is enhanced with semantic interoperability features/functionalities, 
which were not originally foreseen. Specifically, the pilot systems will support inputs 
(e.g., Trades Data) in FIBO (Financial Industry Business Ontology) 7 semantic 
format, to support VaR calculation in cases of portfolios for large investors (e.g. 
large investment banks, institutional investors) that might hold assets/trades across 
multiple platforms. In this case, the VaR of a portfolio might have to be calculated 
based on data from multiple platforms that produce data in different semantics and 
formats. FIBO will ensure the semantic integration and semantic interoperability 
of these streams/trades towards facilitating VaR calculation for large portfolios. The 
integration of a semantic interoperability module in the pilot system is considered 


as a highlight for the pilot. 
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Chapter 4 


INFINITECH Conclusions 


4.1 Conclusions 


This Book Series provides an overview of Fin Tech services and applications devel- 
oped in particular pilot locations across Europe and beyond i.e. Israel, Turkey, etc. 
The aim is to specify different aspects of each large-scale pilot: readiness; develop- 
ment; and validation of different services and components. In so doing, validation 
becomes the core element in this process as the main objectives of INFINITECH 
is to test innovative (IoT, BigData, AI, ML, Blockchain and more) technologies 
towards improving business services in the Financial and Insurance sector. We 
report on the readiness of the various pilot sites to test the INFINITECH innova- 
tive AI, IoT and BigData technologies into the testbeds/sandoxes that are developed 
during the project, while validating their ability to improve the business processes 
of end-user organizations (i.e. financial organizations, banks, and Fin Tech firms). 

In the second chapter we describe innovative technologies for financial sector. We 
explain work package 4 of INFINITECH project in detail. Work package 4, which 
is Interoperable data exchange and semantic interoperability, focuses on establishing 
the foundation for common, shared meaning across the several data sources and 
message and event feeds within the INFINITECH platform while facilitating the 
technical implementation of the INFINITECH principles. It comprises of six tasks 
which are described thoroughly in this chapter. We further present background and 
related works and concepts and definitions. 
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Furthermore, we describe 16 pilots organized in 5 categories in detail. More 
specifically, for each pilot we describe the overall objective of each pilot, technolog- 
ical components and services, testbed, implementation of a first Proof of concept, 
expected outcomes, datasets, data produced, explainable workflow, logical schema, 
components, and conclusions — issues and barriers. During the first year of the 
project, pilot focused on use cases definition, requirements identification, reference 
architecture, and corresponding deliverables. Great effort from all pilots covering 
requests coming from different partners and workpackages; working as a whole. 
Communications have been crucial to organise and progress in a proper way. The 
effort of project partners led into a Proof of Concept (PoC) for all pilots (with 
the exceptions explained at the introduction) that summarizes developments and 
achievements. This PoC also refined the targets of each pilot, whilst helped them to 
identify new requirements and envision possible constraints and issues. This way, 
every pilot can work on an improved and more fruitful outcomes within its clus- 
ter. The following pilot provides an overview of the status of the various pilots and 
illustrates that most pilots have managed to implement an initial proof-of-concept 
and demonstrator: 


* Stakeholders Mobilized. 
© Architecture Finalized. 
* Partial Implementation af PoC. 


* Stakeholders Mabilized. 
e Architecture Finalized. 
* Implementation of Initial Integrated PoC. 


* Pilot Defined and Planned. 
* Stakeholders Engagement Planned. 


* = Pilot still at Specification Stage due to late inclusion 
of key stakeholders in the project. 


Stakeholders Mobilized. 
Architecture Finalized. 
Implementation of Initial Integrated PoC. 


+ Stakeholders Mobilized. 
* Architecture Finalized. 
* Implementation of Initial Integrated PoC. 


* Stakeholders Mobilized. 
* Architecture Finalized. 
* implémentation of Initial Integrated PoC. 


+ Stakeholders Mabilized. 
+ Architecture Finalized. 
* Implementation of Initial Integrated PoC. 


* Stakeholders Mobilized. 
e Architecture Finalized. 
* Implementation of Initial Integrated PoC. 


Figure 4.1. Continued 
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Stakeholders Mobilized. 
Architecture Finalized. 
Implementation of Initial Integrated Pot. 


Stakeholders Mobilized. 
Architecture Finalized. 
Implementation of Initial Integrated PoC. 


Stakeholders Mobilized. 
Architecture Finalized. 
Implementation of Initial Integrated Pot. 


Stakeholders Mobilized. 
Architecture Finalized. 
Implementation of Initial Integrated Pac. 


Stakeholders Mobilized. 
Architecture Finalized. 
Implementation of Initial Integrated Pac. 


Stakeholders Mobilized. 
Architecture Finalized. 
Implementation of Initial Integrated Pot. 


Figure 4.1. High level overview of pilots' implementation status. 


The successful implementation of Proof-of-Concepts for most of the 
INFINITECH pilots provides evidence of progress and readiness for the pilots, 
while at the same time manifesting the collaborative efforts and the synergies 
between the INFINITECH partners. 

So far, pilots development has been running pretty in parallel, because of time 
restrictions, with the technological WPs. Not showing, in general, the technolog- 
ical match between pilots needs and INFINITECH provided technologies. This 
also happened because of INFINITECH technologies have been involved in a 
first definition process. The work done to contribute to this report and putting 
together all PoCs helped to break these silos between pilots and share technolo- 
gies and architecture components. Pilots results will be able to show, use and 
integrate the results provided by technological INFINITECH workpackages. This 
will happen with the INFINITECH technologies more defined (deliverables in 
WP3, WP4 and WP5), and the testbeds and sandboxes ready to support this 
integration. 

WP6 testbeds and sandboxes will unify the way that pilots are prepared, from 
an infrastructure and deployment perspective, to set the base for similar use cases, 
or stakeholders with similar needs, to try INFINITECH technologies. Reference 
Architecture settles the bases for multi-layer architecture, with different compo- 
nents that can be plugged and combined. This approach has been followed by 
all pilots and it will finish with the all the corresponding testbeds deployment. 
This way, Kubernetes and Docker's orchestration framework will demonstrate these 


multi-layer-plugable approach defended by INFINITECH. Pilots already started 
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to work (some fully prepared) with Docker containers for a smooth transition and 
implementation of sandboxes. 

This first phase could be summarized with pilots focusing on data collection 
and preparation and deployment of first set of components. Data capture, filter- 
ing, homogenization (following the Reference Architecture) is already there, and 
it is starting to work. In the coming phase, these components will be comple- 
mented with the results of INFINITECH technologies and services (W3-WP5) 
and the creation of sandboxes (WP6) that fill finalize a common way of working. 
While finishing this deliverable, a testing infrastructure is been put in place. Pilots 
will have an infrastructure to manage their software components, CI/CD tools for 
deployment and an environment to create/use blueprints for their architectures. 
These blueprints will help replicability of similar scenarios and needs, e.g. how to 
get/inject data through a data pipeline into a LeanXcale database for later analysis. 

Cluster 1 comprehends different pilots linked mainly by services and risk assess- 
ment purposes. The overall development and deployment of the pilots is proceed- 
ing as planned. Generally, the requirements and development phase for the first 
two pilots is at an advanced state, on one hand, having already implemented the 
PoC, deployed on a onpremise cloud-based testbed, on the other hand, having 
already implemented the PoC that will be deployed on the shared testbed hosted 
by NOVA. Instead, Pilot£15 will be deployed in the testbed blueprint: indeed Clus- 
ter 1 provides a comprehensive view of pilots deployment by exploiting the three 
different "typologies" of testbeds established by the INFINITECH project. Over- 
all, the development and training of machine learning applications is proceeding, 
enhancing the innovative components of the pilots. To conclude, Pilot£1 already 
planned to perform relevant activities in stakeholders’ engagement, demonstrating 
the matureness and readiness of such pilot, whereas Pilot#2 is a clear example of 
how INFINITECH technologies can be exploited to develop a trading-based risk 
assessment use case. 

Cluster 2 of Pilots that are related on Personalized Retail and Investment Bank- 
ing Services, based on the progress until now, they are progressing following their 
initial plans (except Pilot #3 that is in the process of redefining its scope). The 
majority are in the process of building the ground for each pilot, which includes 
mainly the AI power tools that will be used as basis for the final deployment. Most of 
the pilots either established or in the process of testbed deployment and now based 
on the relative blueprint definition will start working towards to INFINITECH 
way of deployment. Even though the main activities already reported are mainly 
focus on the technical site, the actual goal for each pilot focus on providing tech- 
nologies that will improve the financial health of individuals and SMEs, either 
through better and personalized investment propositions or better financial man- 
agement tools. 
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At cluster 3 level, we can summarize the current status of implementation of 
pilots at a general good and promising point. Each of these pilots implemented a 
PoC initial prototype, provided initial bunch of data and a testbed installation to 
implement the first serviced and develop the need its technology. First data analyt- 
ics components, risk calculation engines, complex search services, and user inter- 
faces (for risk assessments), have been developed. Financial Information synthetic 
data are currently used in research environment at JSI. Blockchain technologies are 
started demonstration to provide more secured and trusted transactions systems, 
facing the difficulties of a so high computational demand derived from these tech- 
nologies: transaction dataset preparation, huge (scalable) transaction graph analysis 
and visualization tools. A case apart is represented by Pilot #7, because of a change 
in pilot partners, and the subsequent need for updated specifications, as well as for 
the redefinition of the pilot according to the new partners. 

Cluster 4 is focused on different insurance services customisation, by exploit- 
ing real world data collected from users through different AI powered technolo- 
gies that evaluate the insured client's behaviour and his/her associated potential 
risks. This first stage on the cluster pilots’ development analysed the different avail- 
able data sources, identifying which are relevant for the use cases to be played and 
built all the mechanisms needed to gather, curate, and homogenise these identified 
datasets. In parallel, the infrastructure to collect, store and classify the information 
has been defined and implemented, so, aligned with INFINITECH development 
and deployment guidelines, the second stage, which will design, build, and run all 
the novel ML models. These ML/DL models will be based on cutting-edge AI tech- 
nologies and will be specifically created to solve the particularities of each scenario. 
In turn, they will be the key component on the final new services to be offered to 
insurance companies and insured clients. 

Cluster 5 is focused on customized and configurable insurance products based 
on non-traditional data sources and not obtained directly from the insured subjects. 
The objective is to obtain a better determination of the insured risk, the insured 
enterprises and agricultural sector. On the one hand, to offer a more adjusted and 
personalised insurance and on the other hand to speed up the payment of the com- 
pensation. The technologies that will be used are based on Machine Learning and 
AI on large amounts of data obtained from sources both in text format and in 
satellite images. The process is composed into three phases, the determination of 
the relevant sources to provide data to the models and their homogenization for 
processing. The second is the management of the data within the reference archi- 
tecture established in the lines of INFINITECH. Finally, ML/DL models will be 
based on cutting-edge AI technologies and will be specifically created to solve the 
particularities of each scenario. 
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Finally, most pilots are now focused on technological developments and not so 
focused on Business Processes and Stakeholders Involvements: 


* Business Process Change and Innovation: What is the system changing in 
the business? How things are done today and how they will be done after 
INFINITECH? 

e Stakeholders’ Involvement: Who is involved from the business side? Who are 
the end-users and how they are involved in the pilots? Are there stakeholders’ 
workshops planned to evaluate the pilot systems? How many participants are 
expected when they will be scheduled? Do we need to train some users to use 
the system? 
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